[Beowulf] Infiniband modular switches
Patrick Geoffray
patrick at myri.com
Mon Jun 16 15:21:51 PDT 2008
Gilad Shainer wrote:
> Static routing is the best approach if your pattern is known. In other
If your pattern is known, and if it is persistent, and it is perfectly
synchronized, and if you have a single job running on the fabric, and if
you have total control of the process/node mapping and if there is no
down/bad links, and if there is no other traffic pattern in the
application, then yes static routing is the best.
In real life, where there are multiple jobs running at once on various
parts of a cluster, where there are always some marginal links, when you
cannot guarantee on which nodes a job will be allocated, and
applications have multiple communication patterns (collectives) and load
is usually unbalanced, static routing is the worst.
> cases it depends on the applications. LANL and Mellanox have presented a
> paper on static routing and how to get the maximum of it last ISC. There
Single app, dedicated machine, total control of the network. Similarly,
I could have a pretty good shot at predicting the next lotto numbers if
I would know the position (and speed) of all atoms in the universe (Dr
Brown, this is for you !).
> are cases where adaptive routing will show a benefit, and this is why we
> see the IB vendors add adaptive routing support as well. But in general,
> the average effective bandwidth is much much higher than the 40% you
> claim.
Have a look at the slides 17 and 19 of the following set of slides (and
slides 21 and 22 to illustrate my point above):
http://www.openib.org/archives/spring2007sonoma/Monday%20April%2030/Leininger-Seager-Adaptive-Routing-OFA-Sonoma-2007-v03.pdf
Hoefler and al have shown an average effective bisection of ~40% on
Infiniband (OMNeT simulations) in a paper submitted to Cluster2008. In a
paper to be presented at Hot Interconnects this year, I have measured
the effective bisection (SendRecv on random pairs) on a 512-node
Myri-10G cluster (single enclosure, 32-port crossbars) under various
routing implementations. Below is the link to pretty graphs with static
and probing adaptive routing:
http://patrick.geoffray.googlepages.com/staticvsadaptiverouting
You can see that the worst case static routing goes quickly below 40%,
but the average eventually goes there as well.
> There are some vendors that uses only the 24 port switches to build very
> large scale clusters - 3000 nodes and above, without any
> oversubscription, and they find it more cost effective. Using single
> enclosures is easier, but the cables are not expensive and you can use
Price of cables usually depends on the length (copper and fiber). Using
small switches at the edges allows to use very short cables to the hosts
(in-rack) but you still have to use the same number of longer cables to
connect to the spine. With a single enclosure, you may need longer
cables to reach the hosts (different rack), but you don't need cables to
the spine as they are on the switch backplane (and PCB is free). Short
cables may not be expensive, but they are not free. Furthermore,
physical cables are much less reliable than wire on PCB, and they take
more space, more power.
Patrick
More information about the Beowulf
mailing list