Fwd: [Beowulf] QDR InfiniBand interconnect architectures ... approaches ...

Craig Tierney Craig.Tierney at noaa.gov
Thu Apr 8 15:12:26 PDT 2010


richard.walsh at comcast.net wrote:
> > On Thursday, April 8, 2010 2:42:49 PM Craig Tierney wrote:
> >
> >
>> >> We have been telling our vendors to design a multi-level tree using
>> >> 36 port switches that provides approximately 70% bisection bandwidth.
>> >> On a 448 node Nehalem cluster, this has worked well (weather, hurricane, and
>> >> some climate modeling). This design (15 up/21 down) allows us to
>> >> scale the system to 714 nodes.
> >
> >
> > Hey Craig,
> >
> >
> > Thanks for the information. So are you driven mostly by the need
> > for incremental expandability with this design, or do you disagree
> > with Greg and think that the cost is as good or better than a chassis
> > based approach? What about reliability (assuming the vendor is
> > putting it together for you) and maintenance headaches? Not so
> > bad? What kind of cabling are you using?
> >

It was cheaper at the time by a lot.

The vendor did not put it together for us.  However, we have had the
same team doing this stuff (building clusters) for 10 years.  So we
tell vendors what to do as we have all the value-add we need.  That
will end sometime, but it works for now.

We have had no maintenance headaches.  the reliability is fine.  We
are using copper QDR cables for the short runs and go to fibre for some
of them.  I can get specifics on the cable manufacturers if you need it.

> >
> > Trying to do the math on the design ... for the 448 nodes you would
> > need 22 switches for the first tier (22 * 21 = 462 down). That gives
> > you (15 * 22 = 330 uplinks), so you need at least 10 switches in the
> > second tier (10 * 36 = 360) which leaves you some spare ports for
> > other things. Am I getting this right? Could you lay out the design
> > in a bit more detail? Did you consider building things from medium
> > size switches (say 108 port models)? Are you paying a premium
> > for incremental expandability or not? How many ports are you using
> > for your file server?

We have 7 racks of compute nodes, each with 64 nodes.  Each rack has
three 36 port switches.  21 nodes plug into each switch, with the
last one plugging into to a switch in the main IB switch rack.

We run 15 cables from each of the node switches to the spines.  For
a full tree that will lead us to having 34 ports used.  The other two
ports (fromt the 15 spine switches) have cables that run up to a higher tier for IO.

As far as the IO goes, you need visualize that we have 4 clusters.
Three of them (360 node, 252 node, 448 node) have two levels to the
tree.  The last is a GPU cluster with 16 nodes.  All of these systems
connect up to another level of IB switches (small-ones, not large ones).

Our filesystems plug into this tree as well.  We used to have Rapidscale
(ugh), but now we have 3 DDN/Lustre solutions and 1 Panasas solution.
The Rapidscale is repurposed for Lustre as well for testing, and so
in aggregate we have about 30 GB/s of IO across all the systems.

We consider every technical configuration that can save us money.  There
was no design that used larger switches as building blocks that would
reduce the price.  We paid extra for the expandability, but it still
wasn't as much as buying the big switches.

Yes, the 2 cables from each spine is overkill for performance.  The
designer is planning on not using 2 from each switch next time.

> >
> >
> > Our system is likely to come in at 192 nodes with some additional
> > ports for file server connection. I would like to compare the cost
> > of a 216 port switch to your 15/21 design using 36 port switches.
> >
> >

So if you did a design like ours, you would have 4 racks.  Three would
be for compute nodes (if you use the twin type supermicro solution or
similar) each with 3 36-port switches.  The fourth rack would be
for the IB switches and other equipment.  That system is small enough
that you shouldn't need any fibre.

So you would have 9 switches in the racks, and 5 spine switches (3 cables
from each rack switch).  Each spine would use 27 ports to the compute,
and you would have 9 extra (overkill) for your IO system.

Total parts:

14 36 port switches
36*15 cables, but all copper.  Cost doesn't change much by length.
Additional cables to connect IO system.

If you find that is cheaper than a single switch, please let me know.

Who sells a 216 port switch?  Are you looking at the larger Voltaire
where you install a number of line boards?

Craig




More information about the Beowulf mailing list