[Beowulf] Re: Re: Home beowulf - NIC latencies

Tue Feb 15 07:41:32 PST 2005

> 
> So if you run an MPI application and it sucks, this is because the 
> application is poorly written ?

Patrick, here the argument is about whether and how you "measure" the 
"performance of MPI". I guess you may have missed some of the preceding 
postings.

> 
> You don't want to benchmark an application to evaluate MPI, you want to 
> benchmark an application to find the best set of resources to get the 
> job done. If the code stinks, it's not an excuse. Good MPI 
> implementations are good with poorly written applications, but still let 
> smart people do smart things if they want.

This is exactly my point made in my previous posting - you cannot design 
a system that is optimal in a single mode for all cases of its use when 
there are multiple parameters defining the usage and performance 
evaluation spaces. And this is the reason why we provide both {polling 
synchronization/polling progress} and {interrupt-driven 
synchronization/independent progress} MPI modes (we have published 
papers defining a space based on MPI design choices). With these modes 
we can at least increase the chance that the user can get a better match 
to his scenario.

>> - What application algorithm developers experience when they attempt to
>> use the ever so nebulous "overlapping" with a polling MPI library and
> 
> Overlaping is completely orthogonal with polling. Overlaping means that 
> you split the communication initiation from the communication 
> completion. Polling means that you test for completion instead of wait 
> for completion. You can perfectly overlap and check for completion of 
> the asynchronous requests by polling, nothing wrong with that.

Well, I would probably have to say that I don't agree with this. First, 
I think it is fairly easy to show that overlapping and polling (or any 
kind of communication completion synchronization) are not orthogonal. If 
this was the case, you would see codes that show perfect overlapping 
running on any MPI implementation/network pair. I am sure there is 
plenty of evidence this is not the case.

There is an important point here that needs to be clarified: when I say 
"polling" library, I assume that this library does both: polling 
completion synchronization and polling progress. There is not much room 
to define here these but I am sure MPI developers know what they are.

If polling and overlapping were orthogonal, the following would have had 
to be true:
1. You have a perfect network engine that takes no resources that might 
be used by computation when you either push bytes out or poll for completion
2. Once you start a request (e.g., MPI_Isend), the execution of this 
communication request takes no CPU.
3. You can have a very cheap, bound in duration polling operation from 
which you return immediately after it checks for your particular 
communication request
4. You have something else to do when the polling completion returns 
that your request is not done

I would argue that none of these are true in practical scenarios, even 
including very smart polling schemes or networks with DMA engines, like 
Myrinet.

Here I don't even bring the cases with multithreaded applications. These 
are still a fairly small minority.

> 
>> how this experience has contributed to the overwhelming use of
>> MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
>> (even better) persistent MPI calls, thus killing any hope that these
>> codes can run faster on systems that actually facilitate overlapping.
> 
> There is 2 reasons why developers use blocking operations rather than 
> non-blocking one:
> 1) they don't know about non-blocking operations.
> 2) MPI_Send is shorter than MPI_Isend().

Here is a third one. Writing your code for overlapping with non-blocking 
MPI calls and segmentation/pipelining, testing the code, and not seeing 
any benefit of it.

> 
> 
> Looking for overlaping is actually not that hard:
> a) look for medium/large messages, don't waste time on small ones.
> b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait()
> c) move the MPI_Isend() as early as possible (as soon as data is ready).
> d) move the MPI_Wait() as late as possible (just before the buffer is 
> needed).
> e) do same for receive.

Not quite. Most of the time the message-passing segment of the code you 
optimize for overlapping is in the innermost loop of the algorithm - the 
one that is most overhead sensitive and usually most optimized. You will 
not see common cases where you can "pull" MPI_Send much earlier or push 
MPI_Wait much later than where MPI_Send is. So what you usually end up 
doing is introducing another loop inside the innermost one, breaking up 
the MPI_Send message in a number of segments and pipelining them with 
MPI_Isend (or even better MPI_Start) by initiating segment I+1 while 
computing with segment I, thus attempting to overlap computation in 
stage I with communication in stage I+1. Then, there is the question how 
many segments you use to break up the message for maximum speedup. The 
pipelining theory says the more you can get the better, when they are 
with equal duration, there aren't inter-stage dependencies, and the 
stage setup time is low in proportion to the stage execution time. Also, 
the size of the segments should be such that the transmission time (not 
the whole latency) of the segment is as close as possible to the 
computation performed on the segment. I can continue with other factors 
that one need to take into account in order to write a good algorithm 
with overlapping.

The metric I mentioned earlier "degree of overlapping" with some 
additional analysis can help designers _predict_ whether the design is 
good or not and whether it will work well or not on a particular system 
of interest (including the MPI library).

This is however too much detail for this forum though, as most of the 
postings here discuss much more practical issues :)

Rossen