[Beowulf] Infiniband modular switches

Gilad Shainer Shainer at mellanox.com
Sun Jun 15 09:08:20 PDT 2008

> > latency difference here matters to many codes).  Perhaps of more 
> > significance, though, is that you can use oversubscription to lower 
> > the cost of your fabric.  Instead of connecting 12 ports of a leaf 
> > switch to nodes and using the other 12 ports as uplinks, 
> you might get 
> > away with
> > 18 nodes and 6 uplinks, or 20 nodes and 4 uplinks.  As core 
> counts are 
> > increasing, this is becoming more and more viable for some 
> applications.
> It's important to note that the "full-bisection" touted by 
> vendors is on paper only. In reality, static routing provides 
> full-bisection for a very small subset of patterns, the 
> average effective bisection on a
> diameter-3 Clos is ~40% of link rate (adaptive routing 
> improves that a lot, but breaks packet order on the wire 
> which is a requirement for some network protocols).

Static routing is the best approach if your pattern is known. In other
cases it depends on the applications. LANL and Mellanox have presented a
paper on static routing and how to get the maximum of it last ISC. There
are cases where adaptive routing will show a benefit, and this is why we
see the IB vendors add adaptive routing support as well. But in general,
the average effective bandwidth is much much higher than the 40% you

> In practice, "paper" full-bisection is near free when using a 
> single enclosure, since all spine cables are on the 
> backplane. For larger networks, where you have to pay for 
> real cables to the spine level, then it may make sense to be 
> oversubscribed if the effective bisection is already bad 
> (static routing), or if your collective communication on 
> large jobs are not bandwidth bounded. However, the later is 
> often false on many-cores.

There are some vendors that uses only the 24 port switches to build very
large scale clusters - 3000 nodes and above, without any
oversubscription, and they find it more cost effective. Using single
enclosures is easier, but the cables are not expensive and you can use
the smaller components. I used the 24 ports switches to connect my 96
node cluster. I will replace my current setup with the new 36 InfiniBand
port switches this month, since they provide lower latency and adaptive
routing capabilities. And if you are bandwidth bounded, using IB QDR
will help. You will be able to drive more than 3GB/s from each server. 

More information about the Beowulf mailing list