[Beowulf] How to configure a cluster network

Mark Hahn hahn at mcmaster.ca
Thu Jul 24 14:00:33 PDT 2008

> Well the top configuration(and the one that I suggested) is the one
> that we have tested and know works. We have implimented it into
> hundereds of clusters. It also provides redundancy for the core
> switches.

just for reference, it's commonly known as "fat tree", and is indeed
widely used.

> With any network you need to avoid like the plauge any kind of loop,
> they can cause weird problems and are pretty much unnessasary. for

well, I don't think that's true - the most I'd say is that given
the usual spanning-tree protocol for eth switches, loops are a bug.
but IB doesn't use eth's STP, and even smarter eth networks can take
good advantage of multiple paths, even loopy ones.

> instance, why would you put a line between the two core switches? Why
> would that line carry any traffic?

indeed - those examples don't make much sense.  but there are many others
that involve loops that could be quite nice.  consider 36 nodes: with 
2x24pt, you get 3:1 blocking (6 inter-switch links).  with 3 switches, 
you can do 2:1 blocking (6 interlinks in a triangle, forming a loop.)
dual-port nics provide even more entertainment (FNN, but also the ability to
tolerate a leaf-switch failure...)

> When you consider that it takes 2-4ìs for an mpi message to get from

depends on the nic - mellanox claims ~1 us for connectx (haven't seen it 
myself yet.)  I see 4-4.5 us latency (worse than myri 2g mx!) on pre-connectx
mellanox systems.

> one node to another on the same switch, each extra hop will only
> introduce another 0.02ìs (I think?) to that latency so its not really

with current hardware, I think 100ns per hop is about right.  mellanox claims
60ns for the latest stuff.

> Most applications dont use anything like the full bandwidth of the
> interconnect so the half bisectionalness of everything can generally
> be safeley ignored.

everything is simple for single-purpose clusters.  for a shared cluster
with a variety of job types, especially for large user populations, large 
jobs and large clusters, you want to think carefully about how much to
compromise the fabric.  consider, for instance, interference between a
bw-heavy weather code and some latency-sensitive application (big and/or

More information about the Beowulf mailing list