[Beowulf] Infiniband: MPI and I/O?
greg at keller.net
Thu May 26 15:30:52 PDT 2011
On 5/26/2011 4:23 PM, Mark Hahn wrote:
>> Agreed. Just finished telling another vendor, "It's not high speed
>> storage unless it has an IB/RDMA interface". They love that. Except
> what does RDMA have to do with anything? why would straight 10G ethernet
> not qualify? I suspect you're really saying that you want an efficient
> interface, as well as enough bandwidth, but that doesn't necessitate
RDMA over IB is definitely a nice feature. Not required, but IP over IB
has enough limits that we prefer to avoid it.
>> for some really edge cases, I can't imagine running IO over GbE for
>> anything more than trivial IO loads.
> well, it's a balance issue. if someone was using lots of Atom boards
> lashed into a cluster, 1Gb apiece might be pretty reasonable. but for
> fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all
> that generous.
> as an interesting case in point, SeaMicro was in the news again with a
> atom system: either 64 Gb links or 16 10G links. the former (.128
> seems low even for atoms, but .3 Gb/core might be reasonable.
>> I am Curious if anyone is doing IO over IB to SRP targets or some
>> similar "Block Device" approach. The Integration into the filesystem by
>> Lustre/GPFS and others may be the best way to go, but we are not 100%
>> convinced yet. Any stories to share?
> you mean you _like_ block storage? how do you make a shared FS namespace
> out of it, manage locking, etc?
Well, it's a use case issue for us. You don't make a shared FS on the
block devices (well, maybe you could just not in a scalable way)... but
we envision leasing block devices to customers with known
capacity/performance capability. Then the customer can make the call if
they want to use it for a CIFS/NFS backend, possibly even lashed
together via MD, through a single server. They can also lease multiple
block devices and create a lustre type system.
The flexibility is if they disappear and come back they may not get the
same compute/storage nodes, but they can attach any server to their
dedicated block storage devices. There are also some multi-tenancy
security options that can be more definitively handled if they have
absolute control over a block device. So in this case, they would
semi-permanently lease the block devices, and then fire up front end
storage nodes and compute nodes on an "as needed / as available" basis
anywhere in our compute farm. Effectively we get the benefits of a
massive Fibre Channel type SAN over the IB infrastructure we have to
every node. If we can get the performance and cost of the block storage
right, it will be compelling for some of our customers.
We are still prototyping how it would work and characterizing
performance options... but it's interesting to us.
> regards, mark hahn.
More information about the Beowulf