[Beowulf] Infiniband: MPI and I/O?

Mark Hahn hahn at mcmaster.ca
Thu May 26 13:13:07 PDT 2011

>>> Wondering if anyone out there is doing both I/O to storage as well as
>>> MPI over the same IB fabric.
>> I would say that is the norm.  we certainly connect local storage (Lustre) 
>> to nodes via the same fabric as MPI.  gigabit is completely
>> inadequate for modern nodes, so the only alternatives would be 10G
>> or a secondary IB fabric, both quite expensive propositions, no?
>> I suppose if your cluster does nothing but IO-light serial/EP jobs,
>> you might think differently.
> Really?  I'm surprised by that statement.  Perhaps I'm just way behind on the 
> curve though.  It is typical here to have local node storage, local 
> lustre/pvfs storage, local NFS storage, and global GPFS storage running over 
> the GigE network.

sure, we use Gb as well, but only as a crutch, since it's so slow.
or does each node have, say, a 4x bonded Gb for this traffic?

or are we disagreeing on whether Gb is "slow"?  80-ish MB/s seems pretty 
slow to me, considering that's less than any single disk on the market...

>> how much inter-chassis MPI do you do?  how much IO do you do?
>> IB has a small MTU, so I don't really see why mixed traffic would be a big 
>> problem.  of course, IB also doesn't do all that wonderfully
>> with hotspots.  but isn't this mostly an empirical question you can
>> answer by direct measurement?
> How would I measure by direct measurement?

I meant collecting the byte counters from nics and/or switches
while real workloads are running.  that tells you the actual data rates,
and should show how close you are to creating hotspots.

> My question really was twofold: 1) is anyone doing this successfully and 2) 
> does anyone have an idea of how loudly my users will scream when their MPI 
> jobs suddenly degrade.   You've answered #1 and seem to believe that for #2, 
> no one will notice.

we've always done it, though our main experience is with clusters that have 
full-bisection fabrics.  our two more recent clusters have half-bisection 
fabrics, but I suspect that most users are not looking closely enough at 
performance to notice and/or complain.

More information about the Beowulf mailing list