[Beowulf] Re: Re: Home beowulf - NIC latencies
patrick at myri.com
Mon Feb 14 15:09:27 PST 2005
Greg Lindahl wrote:
> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
>>Let me ask some stupid's question: which MPI implementations allow
>>a) to overlap MPI_Isend w/computations
>>b) to perform a set of subsequent MPI_Isend calls faster than "the
>>same" set of MPI_Send calls ?
>>I say only about sending of large messages.
> For large messages, everyone does (b) at least partly right. (a) is
> pretty rare. It's difficult to get (a) right without hurting short
> message performance. One of the commercial MPIs, at first release, had
Many believe you just need RDMA support to overlap com and comp, but
it's not enough. Zero-copy is needed because the copy is obviously a
waste of host CPU (along with cache trashing), but the real problem is
matching. Ron did a lot of work in Portals to offload the matching,
because it is a big synchronization point: if you send a message and you
need the CPU on the receive side to find the appropriate receive buffer,
you cannot tell the user that it can have the CPU between the time he
posts the MPI_Irecv() and the time he checks on it with MPI_Wait(). What
will happen is the matching occurs in the MPI_Wait() and overlap goes to
There are several ways to work around it:
1) You can have a thread on the receive side and wake it up with an
interrupt. If you do that for all receives, then you add ~10 us in the
critical path and the small message latency goes to the same place the
overlap went before. This was what I believe the commercial MPI was
doing at first.
2) If you can take decisions at the NIC level, you can receive small
messages eagerly (with a copy) and fire an interrupt only for large
messages (you want to steal some CPU cycles for matching). This is not
bad, you steal (~5 us + cost of matching) worth of CPU cycles for large
messages, that's not much for most people.
3) You can have the NIC doing the matching. Obviously the NIC is not as
fast as the host CPU, so it's more expensive: you don't want to do that
for small messages, it will hurt your latency. But you still has to do
it for all messages to keep the matching order. One solution is to still
receive small messages eagerly but match them in the shadow of the
NIC->host DMA just to keep the list of posted receives consistent. For
large messages, you match in the NIC in the critical path and you don't
need the host CPU (assuming that the matched receive is in the small
number that is kept on the NIC).
It's still not obvious if 3) is worth it, it's much more complex to
implement and 5us per large receive is not that big. And you can reduce
that overhead with MSIs (on PCIe, only the Alpha Marvel provided MSI on
There are more exotic work-arounds, like using 1) and polling at the
same time, and hiding the interrupt overhead with some black magic on
another processor. The one with the best potential would be to use
HyperThreading on Intel chips to have a polling thread burning cycles
continuously; it will run in-cache, won't use the FP unit or waste
memory cycles. A perfect use for the otherwise useless HT feature. I
wonder why nobody went that way...
> right was more important. They've improved their short message
> performance since, but I still haven't seen any real application
> benchmarks that show benefit from their approach.
That's the classical chicken-egg problem: Are people not trying to
overlap in MPI because it is not implemented, or MPI implementations
don't implement it because applications don't try to overlap ?
I think it's the later, too complicated for most. Do you know the
story/joke about the Physicist and unexpected messages ?
More information about the Beowulf