[Beowulf] Infiniband modular switches

Patrick Geoffray patrick at myri.com
Mon Jun 16 15:21:51 PDT 2008

Gilad Shainer wrote:
> Static routing is the best approach if your pattern is known. In other

If your pattern is known, and if it is persistent, and it is perfectly 
synchronized, and if you have a single job running on the fabric, and if 
you have total control of the process/node mapping and if there is no 
down/bad links, and if there is no other traffic pattern in the 
application, then yes static routing is the best.

In real life, where there are multiple jobs running at once on various 
parts of a cluster, where there are always some marginal links, when you 
cannot guarantee on which nodes a job will be allocated, and 
applications have multiple communication patterns (collectives) and load 
is usually unbalanced, static routing is the worst.

> cases it depends on the applications. LANL and Mellanox have presented a
> paper on static routing and how to get the maximum of it last ISC. There

Single app, dedicated machine, total control of the network. Similarly, 
I could have a pretty good shot at predicting the next lotto numbers if 
I would know the position (and speed) of all atoms in the universe (Dr 
Brown, this is for you !).

> are cases where adaptive routing will show a benefit, and this is why we
> see the IB vendors add adaptive routing support as well. But in general,
> the average effective bandwidth is much much higher than the 40% you
> claim.  

Have a look at the slides 17 and 19 of the following set of slides (and 
slides 21 and 22 to illustrate my point above):

Hoefler and al have shown an average effective bisection of ~40% on 
Infiniband (OMNeT simulations) in a paper submitted to Cluster2008. In a 
paper to be presented at Hot Interconnects this year, I have measured 
the effective bisection (SendRecv on random pairs) on a 512-node 
Myri-10G cluster (single enclosure, 32-port crossbars) under various 
routing implementations. Below is the link to pretty graphs with static 
and probing adaptive routing:

You can see that the worst case static routing goes quickly below 40%, 
but the average eventually goes there as well.

> There are some vendors that uses only the 24 port switches to build very
> large scale clusters - 3000 nodes and above, without any
> oversubscription, and they find it more cost effective. Using single
> enclosures is easier, but the cables are not expensive and you can use

Price of cables usually depends on the length (copper and fiber). Using 
small switches at the edges allows to use very short cables to the hosts 
(in-rack) but you still have to use the same number of longer cables to 
connect to the spine. With a single enclosure, you may need longer 
cables to reach the hosts (different rack), but you don't need cables to 
the spine as they are on the switch backplane (and PCB is free). Short 
cables may not be expensive, but they are not free. Furthermore, 
physical cables are much less reliable than wire on PCB, and they take 
more space, more power.


More information about the Beowulf mailing list