[Beowulf] Infiniband modular switches

Patrick Geoffray patrick at myri.com
Thu Jun 26 21:49:10 PDT 2008

Gilad Shainer wrote:
> Not only that I was there, but also had conversations afterwards. It is
> a really "fair" comparison when you have different injection
> rate/network capacity parameters. You can also take 10Mb and inject it
> into 10Gb/s network to show the same, and you always can create the
> network pattern to show what you want to show, but you prove nothing

The injection rate is irrelevant in these tests and the network pattern 
is well defined: *random* pairwise exchange. In both cases (IB and 
Quadrics in the slides), the fabric is full bisection, ie there are 
enough links in the network to support the aggregate traffic of all 
ports. The test consists in measuring the MPI bandwidth between random 
pair of nodes simultaneously.

Logically, you would expect to reach the full bandwidth between all 
pairs, because there are enough links in the fabric to support this 
traffic. If you measure each pair independently, you will always get the 
link rate, no problem. However, if you measure them simultaneously, you 
will have contention: a few pairs may still reach full bandwidth but 
most will only get a fraction of it. You can measure the min, max and 
average of the bandwidth between these pairs for a large number of 
different pairs to evaluate the efficiency of the routing.

The link bandwidth (injection rate) is irrelevant because the results 
are normalized (efficiency). What the slides show is that the efficiency 
of Quadrics is better (the average bandwidth is higher despite a lower 
link bandwidth) and the bandwidth distribution is very narrow for 
Quadrics (spread between min and max pairwise bandwidth). This is a 
direct result of adaptive routing in Quadrics vs static routing in IB. 
Woven Systems reported similar results at Sandia using adaptive routing 
in Ethernet vs static routing in IB.

With static routing, you can find *one* set of routes that will provide 
full bandwidth between all pairs for a given set of pairs. If you change 
the set of pairs without changing the set of routes, then you will get 
much less than full bandwidth. In average, if you measure with enough 
random set of pairs, you will get an aggregate efficiency of ~40% with 
static routing, on several interconnects using full bisection topologies 
(Clos or Fat Tree), single virtual channel, wormhole switching and 
static routing. It has nothing to do with link rate, it is due to 
Head-of-Line (HOL) blocking: 

> here. I am not favor of static routing only or adaptive routing only,
> and having both options is the most flexible solution.

It's not as simple as that. If you have a cluster that will run multiple 
jobs, most likely at the same time, which routing do you use ? If you 
use static routing, efficiency may be good for one job, and bad for 
another. Worse, the efficiency will change if I run the same job on 
different nodes, or depending on what other job is running at the same 
time on the cluster. If you use adaptive routing, efficiency will most 
likely be higher (maybe not by much) but, more important, it will be 
more deterministic. Determinism means less load unbalance, predictable 
time to completion, higher job throughput.

So far, IB only used static routing. If it still relies on packet order 
on the wire for a given Queue Pair, then the only way to do some sort of 
adaptive routing is to use a different QP for each possible route (LID). 
This is what Panda's group tried in a paper. However, the number of QP 
explodes, each QP is still subject to HOL blocking and the QP 
interleaving is static.

>> You can see that the worst case static routing goes quickly 
>> below 40%, but the average eventually goes there as well.
> So what is your proof point here? I am sure you will find many cases
> that static routing will do better (definitely on other interconnects)
> and cases for adaptive routing. 

No, static routing is static routing, on all interconnects. There is no 
magic here, HOL blocking applies to everybody. My point is that under 
*random* structured patterns (such as pairwise exchange), static routing 
sucks. There are no other cases of random, it's just random.

If you want to argue that structured traffic patterns across multiple 
jobs running simultaneously on the same fabric are not equivalent to 
random structured traffic, then this will go nowhere.


More information about the Beowulf mailing list