[Beowulf] Multirail Clusters: need comments

Mon Dec 5 03:33:24 PST 2005

On Sun, 2005-12-04 at 13:13 -0500, Mark Hahn wrote:
> so you're talking about 6 GB/s over quad-rail.  it's hard to 
> imagine what you would do with that kind of bandwidth

This is a very bold assertion...

> only you can answer this.  my experience is that very few applications
> need anything like that much bandwidth - I don't think anyone ever saturated 
> 1x quadrics on our alpha/elan3 clusters (~300 MB/s), and even on
> opteron+myri-d, latency seems like more of an issue.

Your alpha/elan3 clusters would have been quad CPU machines.

> > Note: I found just one line in a CLRC Daresbury lab presentation about
> > quadrail Quadrics on Alpha (probably QsNet-1?) Any update on QsNet2/Eagle?
> 
> at least several qsnet2/elan4 clusters ran with dual-rail.  there seem to be 
> lots of dual-port IB cards out there, but I have no idea how many sites are
> using both ports.
> 
> as far as I can tell, dual-rail is quite a specialized thing, simply because
> the native interconnects are pretty damn fast at 1x, and because when you 
> double the number of switch ports, you normally _more_ than double the cost.
> this is mostly an issue when you hit certain thresholds related to the switch
> granularity of the fabric.  128x for quadrics, for instance.  once you start
> federating switches, you're swimming in cables.  it's often quite reasonable
> to go with not-full-bisection fabrics at this scale, but if you're doing 
> multirail in the first place, that doesn't make sense!

I can't speak for IB but with quadrics the second rail is *exactly* the
same in terms of topology as the first one and hence the cost is double.
not-full-bisection and federation relate to the size (number of ports)
of the network, not the number of rails.  Each rail needs it's own host
bus to plug into however (the bus is the bottleneck, not the network) so
you need to have the right machine in the first place which may cost
more money.

The one point you have missed however with multi-rail is that network
bandwidth is per *node* whereas number of CPU's per node is for the
large part increasing.  1Gb/s seems like a lot (or at least it did) but
put it in a 16 CPU machine and all of a sudden you have *less* per CPU
bandwidth than you had seven years ago in your alpha/elan3.  Couple that
with CPU's being n times faster to boot and all of a sudden multi-rail
is starting to less pie-in-the-sky and more look like a good idea.

It's true that it won't buy you latency, how could it?  Bandwidth for
the most part however does what you would expect, it increases linearly
with the number of rails.  There are some cases where this doesn't
appear to hold true, for example given a 16*16 machine average bandwidth
between two CPUS won't quite double as you double the number of rails
because 15/256 ranks are local to any given process so will get linear
bandwidth independent of the number of rails.  This however is simply a
matter of understanding the topology of the machine.  Another odd case
is broadcast, assuming the network can deliver into a 16 CPU machine at
2GB/s shepherding this data inside the the node 16 distinct memory
locations within that node at 2GB/s *each* isn't possible and the
network ends up waiting for the node.

The greatest number of rails I've ever seen in one machine was seven
however this was a old alpha test cluster and was done as a proof of
concept rather than a actual product, it only had two nodes.

Ashley,