[Beowulf] Multirail Clusters: need comments

Sun Dec 4 10:13:25 PST 2005

> esp, quad rail networks. Are quad rail networks practical to implement ( say
> InfiniBand, Quadrics, Myrinet,  (GE, maybe not?)).

sure they're practical; do you _really_ have a need for that much bandwidth?

> - What are the issues?

cost, cabling, lack of need?

> - Are there any quad rail HPC clusters? (me and google couldn't find any)

good lord!  current quadrics is limited to about 1 GB/s aggregate,
but decent IB (including Infinipath) manage >1.5 GB/s (rx+tx) over one 
link, so you're talking about 6 GB/s over quad-rail.  it's hard to 
imagine what you would do with that kind of bandwidth, not to mention
how much CPU it would take to push around the bytes, and whether you'd
have anything left over for _computation_.

or are you asking in the context of a fabric for large SMP's?
(that is, very fat nodes).  or perhaps some kind of block-data server
that actually doesn't do any computation?  perhaps my mind is feeble,
but it's hard to see how even 3 GB/s would make sense today: you'd 
have a hard time sustaining that with disks, even $$$ arrays.
and how much memory do you have per node?  even if you have 128GB,
you'll suck it dry in 43 seconds.

> - Are drivers an issue?

afaikt, no.  I've never seen any application which could justify 
dual-rail, but quadrics, IB and Myri seem to do it nicely.  it is,
after all, not conceptually difficult.  (and sure, dual-rail GE is 
pretty doable as well, though things like LACP need multiple streams
to divide the traffic over, since they don't stripe like raid0.)

> - Does the performance increase significantly to justify the cost and
> complexity?

only you can answer this.  my experience is that very few applications
need anything like that much bandwidth - I don't think anyone ever saturated 
1x quadrics on our alpha/elan3 clusters (~300 MB/s), and even on
opteron+myri-d, latency seems like more of an issue.

personally, I find that latency is more interesting: as algorithms get
smarter, they tend to spend more of their time latency-dominated.
as programs scale to more cpus, the latency of collectives becomes more 
important.  I've never seen a study which shows that multirail helps 
for latency - in fact, fractional-bisection fabrics are often claimed 
to _not_ hurt latency...

> The assumption is that the plan is to solve comm or BW crippled apps, on
> large SMP nodes (say 8+ CPUs).

you need to look closely at your apps, not guess about this kind of thing.
for instance, how much of your per-cpu bandwidth needs are satisfied 
in-box, without resorting to the interconnect?  if the granularity of your 
SMP grows, but apps don't scale to more cpus, you need _less_ interconnect
bandwidth...

> Note: I found just one line in a CLRC Daresbury lab presentation about
> quadrail Quadrics on Alpha (probably QsNet-1?) Any update on QsNet2/Eagle?

at least several qsnet2/elan4 clusters ran with dual-rail.  there seem to be 
lots of dual-port IB cards out there, but I have no idea how many sites are
using both ports.

as far as I can tell, dual-rail is quite a specialized thing, simply because
the native interconnects are pretty damn fast at 1x, and because when you 
double the number of switch ports, you normally _more_ than double the cost.
this is mostly an issue when you hit certain thresholds related to the switch
granularity of the fabric.  128x for quadrics, for instance.  once you start
federating switches, you're swimming in cables.  it's often quite reasonable
to go with not-full-bisection fabrics at this scale, but if you're doing 
multirail in the first place, that doesn't make sense!