[Beowulf] Re: Re: Home beowulf - NIC latencies

Mon Feb 14 15:09:27 PST 2005

Hi Greg,

Greg Lindahl wrote:
> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
> 
> 
>>Let me ask some stupid's question: which MPI implementations allow
>>really
>> 
>>a) to overlap MPI_Isend w/computations
>>and/or 
>>b) to perform a set of subsequent MPI_Isend calls faster than "the 
>>same" set of MPI_Send calls ?
>>
>>I say only about sending of large messages.
> 
> 
> For large messages, everyone does (b) at least partly right. (a) is
> pretty rare. It's difficult to get (a) right without hurting short
> message performance. One of the commercial MPIs, at first release, had

Many believe you just need RDMA support to overlap com and comp, but 
it's not enough. Zero-copy is needed because the copy is obviously a 
waste of host CPU (along with cache trashing), but the real problem is 
matching. Ron did a lot of work in Portals to offload the matching, 
because it is a big synchronization point: if you send a message and you 
need the CPU on the receive side to find the appropriate receive buffer, 
you cannot tell the user that it can have the CPU between the time he 
posts the MPI_Irecv() and the time he checks on it with MPI_Wait(). What 
will happen is the matching occurs in the MPI_Wait() and overlap goes to 
the toilettes.

There are several ways to work around it:
1) You can have a thread on the receive side and wake it up with an 
interrupt. If you do that for all receives, then you add ~10 us in the 
critical path and the small message latency goes to the same place the 
overlap went before. This was what I believe the commercial MPI was 
doing at first.
2) If you can take decisions at the NIC level, you can receive small 
messages eagerly (with a copy) and fire an interrupt only for large 
messages (you want to steal some CPU cycles for matching). This is not 
bad, you steal (~5 us + cost of matching) worth of CPU cycles for large 
messages, that's not much for most people.
3) You can have the NIC doing the matching. Obviously the NIC is not as 
fast as the host CPU, so it's more expensive: you don't want to do that 
for small messages, it will hurt your latency. But you still has to do 
it for all messages to keep the matching order. One solution is to still 
receive small messages eagerly but match them in the shadow of the 
NIC->host DMA just to keep the list of posted receives consistent. For 
large messages, you match in the NIC in the critical path and you don't 
need the host CPU (assuming that the matched receive is in the small 
number that is kept on the NIC).

It's still not obvious if 3) is worth it, it's much more complex to 
implement and 5us per large receive is not that big. And you can reduce 
that overhead with MSIs (on PCIe, only the Alpha Marvel provided MSI on 
PCI-X, AFAIK).

There are more exotic work-arounds, like using 1) and polling at the 
same time, and hiding the interrupt overhead with some black magic on 
another processor. The one with the best potential would be to use 
HyperThreading on Intel chips to have a polling thread burning cycles 
continuously; it will run in-cache, won't use the FP unit or waste 
memory cycles. A perfect use for the otherwise useless HT feature. I 
wonder why nobody went that way...

> right was more important. They've improved their short message
> performance since, but I still haven't seen any real application
> benchmarks that show benefit from their approach.

That's the classical chicken-egg problem: Are people not trying to 
overlap in MPI because it is not implemented, or MPI implementations 
don't implement it because applications don't try to overlap ?

I think it's the later, too complicated for most. Do you know the 
story/joke about the Physicist and unexpected messages ?

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com