[Beowulf] Q: IB message rate & large core counts (per node)?

Tue Feb 23 14:32:21 PST 2010

On Tue, Feb 23, 2010 at 04:57:23PM -0500, Mark Hahn wrote:

> in the interests of less personal/posturing/pissing, let me ask:
> where does the win from coalescing come from?  I would have thought
> that coalescing is mainly a way to reduce interrupts, a technique
> that's familiar from ethernet interrupt mitigation, NAPI, even basic disk 
> scheduling.

The coalescing we're talking about here is more like TCP's Nagle
algorithm: The sending side defers sending a packet so that it can
send a single larger packet instead of several small ones.

In HPC we mostly hate the Nagle algorithm, because it isn't
omniscient: it tends to always delay our messages hoping to get a 2nd
one to the same target, but we rarely send a 2nd message to the same
target that could be combined. People don't write much MPI code that
works like that; it's always better to do the combining yourself.

> to me it looks like the key factor would be "propagation of desire" -  
> when the app sends a message and will do nothing until the reply,
> it probably doesn't make sense to coalesce that message.

Yes, that's one way to think about it.

> assuming MPI is the application-level interface, are there interesting
> issues related to knowing where to deliver messages?  I don't have a  
> good understanding about where things stand WRT things like QP usage
> (still N*N?  is N node count or process count?) or unexpected messages.

A traditional MPI implementation uses N QPs x N processes, so the
global number of QPs is N^2. InfiniPath's pm library for MPI uses a
much smaller endpoint than a QP. Using a ton of QPs does slow down
things (hurts scaling), and that's why SRQ (shared receive queues) was
added to IB.  MVAPICH has several different ways it can handle
messages, configured (last I looked) at compile time: checking memory
for delivered messages for tiny clusters, ordinary QPs at medium size,
SRQ at large cluster sizes. The reason it switches is scalability;
SQRs scale better but are fairly expensive in the Mellanox silicon.

Since latency/bandwidth benchmarks are generally run at only 2 nodes,
well, you can fill in the rest of this paragraph.

InfiniPath's pm library uses a lighter-weight thing that's somewhat
like an SRQ -- at all cluster sizes. This is why it scales so nicely.
It wasn't a novel invention -- the T3E MPI implementation used a
similar gizmo.

> now that I'm inventorying ignorance, I don't really understand why RDMA 
> always seems to be presented as a big hardware issue.  wouldn't it be 
> pretty easy to define an eth or IP-level protocol to do remote puts,
> gets, even test-and-set or reduce primitives, where the interrupt handler
> could twiddle registered blobs of user memory on the target side?

That approach is called Active Messages, and can be bolted on to
pretty much every messaging implementation. Doesn't OpenMX provide
that kind of interface?

The NoSQL distributed computing thingie we built for Blekko's search
engine uses active messages.

-- greg