[Beowulf] Infiniband: MPI and I/O?
bill at Princeton.EDU
Thu May 26 12:20:19 PDT 2011
Mark Hahn wrote:
>> Wondering if anyone out there is doing both I/O to storage as well as
>> MPI over the same IB fabric.
> I would say that is the norm. we certainly connect local storage
> (Lustre) to nodes via the same fabric as MPI. gigabit is completely
> inadequate for modern nodes, so the only alternatives would be 10G
> or a secondary IB fabric, both quite expensive propositions, no?
> I suppose if your cluster does nothing but IO-light serial/EP jobs,
> you might think differently.
Really? I'm surprised by that statement. Perhaps I'm just way behind
on the curve though. It is typical here to have local node storage,
local lustre/pvfs storage, local NFS storage, and global GPFS storage
running over the GigE network. Depending on I/O loads users can make
use of the storage at the right layer. Yes, users fill the 1Gbps pipe
to the storage per node. But as we now implement all new clusters with
IB I'm hoping to increase that bandwidth even more. If you and everyone
else is doing this already, that's a good sign! Lol! As we move closer
to making this happen, perhaps there will be plenty of answers then for
any QOS setup questions I may have.
>> Following along in the Mellanox User's
>> Guide, I see a section on how to implement the QOS for both MPI and my
>> lustre storage. I am curious though as to what might happen to the
>> performance of the MPI traffic when high I/O loads are placed on the
> to me, the real question is whether your IB fabric is reasonably close
> to full-bisection (and/or whether your storage nodes are sensibly placed,
>> In our current implementation, we are using blades which are 50%
>> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to
>> other nodes. Would trying to do storage on top dictate moving to a
>> totally non-blocking fabric?
> how much inter-chassis MPI do you do? how much IO do you do?
> IB has a small MTU, so I don't really see why mixed traffic would
> be a big problem. of course, IB also doesn't do all that wonderfully
> with hotspots. but isn't this mostly an empirical question you can
> answer by direct measurement?
How would I measure by direct measurement? I don't have the switching
infrastructure to compare a 2:1 versus a 1:1 unless you're talking about
inside a chassis. But since my storage would connect into the switching
infrastructure how and what would I compare?
Jobs are not scheduled to run on a single chassis, or at least they try
to but are not placed on hold for more than 10 minutes waiting. So
there are lots of wide jobs running between chassis. Some don't even
fit on a chassis. As for the question of how much data, I don't have
answer. I know that a 10Gbps pipe hits 4Gbps for sustained periods to
our central storage from the cluster. I also know that I can totally
overwhelm a 10G connected OSS which is currently I/O bound.
My question really was twofold: 1) is anyone doing this successfully and
2) does anyone have an idea of how loudly my users will scream when
their MPI jobs suddenly degrade. You've answered #1 and seem to
believe that for #2, no one will notice.
> regards, mark hahn.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf