[Beowulf] Infiniband: beyond 24 ports

Joe Landman landman at scalableinformatics.com
Mon Aug 25 15:38:50 PDT 2008

Gus Correa wrote:
> Hello Rocks network experts
> Consider a cluster with 24 compute nodes, one head node, and one storage 
> node.
> Imagine that one wants to install Infiniband (IB)  and use it for MPI 
> and/or
> for NFS or parallel file system services.
> IB switches larger than 24 ports are said to be significantly more 
> expensive than the 24-port ones.
> Questions:
> 1) What is the cost-effective yet efficient way to connect this cluster 
> with IB?

Understand that cost-effective is not necessarily the highest
performance.  This is where over-subscription comes in.

> 2) How many switches are required, and of which size?

With the new 36 port switch chips from Mellanox, you should need 1.  For
a reasonable oversubscription (16 down, 8 up), you would need 3 switches
... 1 master and 2 leaf switches.

> 3) How should these switches be connected to the nodes and to each 
> other, which topology?

Hard to draw in ascii art.

> 4) Does the same principle and topology apply to Ethernet switches?

Sort of, though in Ethernet switches there is usually less stress on
oversubscription of links.  If you are building a gigabit MPI cluster,
you really want the switch ports as flat as possible.  Daisy chaining is
fine for offices, it is a bad idea for MPI networks.

> If anyone has a pointer to an article or a link to web page that 
> explains this,
> just send it to me please, don't bother to answer the questions.
> My (in)experience is limited to small clusters with a single switch,
> but hopefully the information will help other folks in the same situation.
> I saw a  24+1-node IB cluster with the characteristics above -
> except that the head node seems to double as storage node.
> The cluster has *four* 24-port IB switches.  One switch has 24 ports 
> connected, two others have 16 ports connected, and the last one has 17 
> ports connected.
> Hard to figure out the topology just looking at the connectors and the 
> tightly bundled cables.
> In my naive thoughts the job could be done with two switches only.

You could if you don't really care about bandwidth and oversubscription.

Since these nets are designed for high performance it makes sense to try
to run them at high speed, and only oversubscribe if you must, and only
by the amount you need.  Extra contention usually means timing
jitter/delays/slower runs.

If your storage node can handle multiple IBs in, it might not be a bad
idea in some cases.  If you are looking to use the high speed net for
storage, please be aware that 2.6.25 and later kernels contain support
for NFS over RDMA (needed on both client and server).  We have test
kernels we are using with JackRabbit for this.  Over SDR IB, we see ~460
MB/s for a link that gets ~750 MB/s using the ib_rdma_bw tool.  Compare
this to NFS over IPoIB, and you will get about 250 MB/s or so there.
Other modalities for high speed storage are possible.


> Thank you
> Gus Correa

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list