[Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes

Gus Correa gus at ldeo.columbia.edu
Thu Sep 3 10:25:01 PDT 2009

Rahul Nabar wrote:
> On Thu, Sep 3, 2009 at 10:19 AM, Gus Correa<gus at ldeo.columbia.edu> wrote:
>> See these small SDR switches:
>> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13
>> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10
>> And SDR HCA card:
> Thanks Gus! This info was very useful. A 24port switch is $2400 and
> the card $125. Thus each compute node would be approximately $300 more
> expensive. (How about infiniband cables? Are those special and how
> expensive. I did google but was overwhelmed by the variety available.)

Hi Rahul

IB cables (0.5-8m,$40-$109):

etc ...

> This isn't bad at all I think. If I base it on my curent node  price
> it would require only about a 20% performance boost to justify this
> investment. I feel Infy could deliver that. When I had calculated it
> the economics was totally off; maybe I had wrong figures.
> The price-scaling seems tough though. Stacking 24 port switches might
> get a bit too cumbersome for 300 servers. 

It probably will.
I will defer any comments to the network pros on the list.

Here is a suggestion.
I would guess that if you don't intend to run the codes,
say, on more than 24-36 nodes at once, you might as well not stack all 
the small IB switches.
I.e., you could divide the cluster
IB-wise into smaller units, of perhaps 36 nodes or so, with 2-3
switches serving each unit.
Not sure how to handle the IB subnet(s) manager in such a configuration,
but there may be ways around.
This scheme may take some scheduler configuration to
handle MPI job submission,
but it may save you money and hardware/cabling complexity,
and still let you run MPI programs with a substantial
number of processes.

You can still fully connect the 300 nodes through Gbit Ether, for admin
and I/O purposes, stacking 48-port GigE switches.
IB is a separate (set of) network(s),
which I assume will be dedicated to MPI only.

You may want to check the 36-port IB switches also, but IIRR they are
only DDR and QDR, not SDR, and somewhat more expensive.

> But when I look at
> corresponding 48 or 96 port switches the per-port-price seems to shoot
> up. Is that typical?

I was told the current IB switch price threshold is 36-port.
Above that it gets too expensive, the cost-effective
solution is stacking smaller switches.
I'm just passing the information/gossip along.

>> For a 300-node cluster you need to consider
>> optical fiber for the IB uplinks,
> You mean compute-node-to-switch and switch-to-switch connections?
> Again, any $$$ figures, ballpark?

I would guess you may need optical fiber for switch-switch connections.
Depending on the distance, of course,
say, across two racks, if this type of connection is needed.
Regular IB cables are probably able handle the node-switch links,
if the switches are distributed across the racks.

>> I don't know about your computational chemistry codes,
>> but for climate/oceans/atmosphere (and probably for CFD)
>> IB makes a real difference w.r.t. Gbit Ethernet.
> I have a hunch (just a hunch) that the computational chemistry codes
> we use haven't been optimized to get the full advantage of the latency
> benefits etc. Some of  the stuff they do is pretty bizarre and
> inefficient if you look at their source codes (writing to large I/O
> files all the time eg.) I know this ought to be fixed but there that
> seems a problem for another day!

Not only your Chem codes.
Brute force I/O is rampant here also.
Some codes take pains to improve MPI communication on the domain 
decomposition side, with asynchronous communication, etc,
then squander it all by letting everybody do I/O in unison.
(Hence, keep in mind Joshua's posting about educating users and
adjusting codes to do I/O gently.)

I hope this helps.
Gus Correa
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA

More information about the Beowulf mailing list