[Beowulf] RDMA and MPI Matching [Was Re: Re: Beowulf Digest, Vol 37, Issue 58]

Mon Apr 16 22:44:41 PDT 2007

On Thu, 12 Apr 2007, H?kon Bugge wrote:

> The dataset is fixed, elapsed time includes 
> initialization, write of animation files and 
> more. Hence, slower per node performance would
> _scale_ better.

My comparison and measured scalability is based on each node's
speedup relative to their own 2p performance.  Both show the same
*relative* speedup until 32p, where one of the two configurations
doesn't match the other in relative scalability.

> what I have shown is that an RDMA interconnect 
> performs faster than a message passing 
> interconnect which has roughly  3x lower latency 
> and 20x (?) higher message rate upto a scaling 
> point where the RDMA _implementation_ collapses. 

I don't know about the 3x/20x numbers.  I can tell you that in the
ls-dyna message profiles that I've looked at (for 4p to 32p), the
application is dominated by large messages with the neon_reference
dataset, so latency and per-message overhead are not likely to be
important performance determinants.

> And this _despite_ the fact the RDMA based MPI 
> has to perform the MPI message matching.

I wouldn't overstate the cost of the matching as so.  The fact that
an MPI implementation employs RDMA to send MPI envelopes makes the
matching cost apparent to that implementation, but everyone
implementing MPI has to pay the non-zero cost of message matching
somewhere. 

> I doubt you're missing anything;-) But let me 
> stress that as the number of cores per node 
> scale, a message passing semantics HCA with 
> message matching in the HCA will have a constant 
> message matching rate. An RDMA based MPI which 
> uses the cores for message matching, the message 
> matching rate would be almost proportional to the number of
> cores...

Your point brings up a few interesting questions, but I'd further
contribute to it by separating interface from implementation.  Since
RDMA is really a pure form of low-level data movement with very
little implied control, there's no specification as to how to do
the message matching.  With what most people agree to call RDMA,
the matching has to be done as a separate operation once the data
movement has happened.  A message matching interfaces implies
elements of data movement and control, and "matching in the NIC" is
just an implementation of one of these control operations.  However,
a message matching interface is broader in its specification, the
message matching can happen on either side of the PCI bus.

To my knowledge, only a fraction of interconnects with "message
matching" APIs do the message matching on "the I/O side" of the PCI
bus.  I'd be interested in hearing what their take is on pursuing
matching in the NIC in the face of an increasing number of cores per
node.  Matching in the NIC can be extremely painful to implement --
memory constraints for potentially long match lists (although those
long lists are rare), the fact that MPI_ANY_SOURCE turns the match
lists into serialization points between shared-memory and
interconnect communication (more complexity & synchronization over
the PCI bus, etc).  I would have said that matching in the NIC was a
clear win a few years ago but now that processing cores are a-plenty
and that the NIC has become a serialization point for more of these
cores, the design space has changed considerably.

    . . christian

-- 
christian.bell at qlogic.com
(QLogic SIG, formerly Pathscale)