[Beowulf] Infiniband modular switches

Gilad Shainer Shainer at mellanox.com
Thu Jun 26 17:16:52 PDT 2008

Patrick Geoffray wrote:

> > There are cases where adaptive routing will show a benefit, and 
> this is why 
> > we see the IB vendors add adaptive routing support as well. But in 
> > general, the average effective bandwidth is much much 
> higher than the 
> > 40% you claim.
> Have a look at the slides 17 and 19 of the following set of 
> slides (and slides 21 and 22 to illustrate my point above):
> http://www.openib.org/archives/spring2007sonoma/Monday%20April

Not only that I was there, but also had conversations afterwards. It is
a really "fair" comparison when you have different injection
rate/network capacity parameters. You can also take 10Mb and inject it
into 10Gb/s network to show the same, and you always can create the
network pattern to show what you want to show, but you prove nothing
here. I am not favor of static routing only or adaptive routing only,
and having both options is the most flexible solution. 

> Hoefler and al have shown an average effective bisection of 
> ~40% on Infiniband (OMNeT simulations) in a paper submitted 
> to Cluster2008. In a paper to be presented at Hot 
> Interconnects this year, I have measured the effective 
> bisection (SendRecv on random pairs) on a 512-node Myri-10G 
> cluster (single enclosure, 32-port crossbars) under various 
> routing implementations. Below is the link to pretty graphs 
> with static and probing adaptive routing:
> http://patrick.geoffray.googlepages.com/staticvsadaptiverouting
> You can see that the worst case static routing goes quickly 
> below 40%, but the average eventually goes there as well.

So what is your proof point here? I am sure you will find many cases
that static routing will do better (definitely on other interconnects)
and cases for adaptive routing. 

> > There are some vendors that uses only the 24 port switches to build 
> > very large scale clusters - 3000 nodes and above, without any 
> > oversubscription, and they find it more cost effective. 
> Using single 
> > enclosures is easier, but the cables are not expensive and 
> you can use
> Price of cables usually depends on the length (copper and 
> fiber). Using small switches at the edges allows to use very 
> short cables to the hosts
> (in-rack) but you still have to use the same number of longer 
> cables to connect to the spine. With a single enclosure, you 
> may need longer cables to reach the hosts (different rack), 
> but you don't need cables to the spine as they are on the 
> switch backplane (and PCB is free). Short cables may not be 
> expensive, but they are not free. Furthermore, physical 
> cables are much less reliable than wire on PCB, and they take 
> more space, more power.

Again, case by case. You can build large cluster with very short cables.
Some vendors find it better and some preferred to use large switches -
the largest one is the 3456 port switch from Sun - used in the #4 on the
Top500 (TACC)

More information about the Beowulf mailing list