[Beowulf] Re: Re: Home beowulf - NIC latencies
Rossen Dimitrov
rossen at VerariSoft.Com
Tue Feb 15 07:41:32 PST 2005
>
> So if you run an MPI application and it sucks, this is because the
> application is poorly written ?
Patrick, here the argument is about whether and how you "measure" the
"performance of MPI". I guess you may have missed some of the preceding
postings.
>
> You don't want to benchmark an application to evaluate MPI, you want to
> benchmark an application to find the best set of resources to get the
> job done. If the code stinks, it's not an excuse. Good MPI
> implementations are good with poorly written applications, but still let
> smart people do smart things if they want.
This is exactly my point made in my previous posting - you cannot design
a system that is optimal in a single mode for all cases of its use when
there are multiple parameters defining the usage and performance
evaluation spaces. And this is the reason why we provide both {polling
synchronization/polling progress} and {interrupt-driven
synchronization/independent progress} MPI modes (we have published
papers defining a space based on MPI design choices). With these modes
we can at least increase the chance that the user can get a better match
to his scenario.
>> - What application algorithm developers experience when they attempt to
>> use the ever so nebulous "overlapping" with a polling MPI library and
>
> Overlaping is completely orthogonal with polling. Overlaping means that
> you split the communication initiation from the communication
> completion. Polling means that you test for completion instead of wait
> for completion. You can perfectly overlap and check for completion of
> the asynchronous requests by polling, nothing wrong with that.
Well, I would probably have to say that I don't agree with this. First,
I think it is fairly easy to show that overlapping and polling (or any
kind of communication completion synchronization) are not orthogonal. If
this was the case, you would see codes that show perfect overlapping
running on any MPI implementation/network pair. I am sure there is
plenty of evidence this is not the case.
There is an important point here that needs to be clarified: when I say
"polling" library, I assume that this library does both: polling
completion synchronization and polling progress. There is not much room
to define here these but I am sure MPI developers know what they are.
If polling and overlapping were orthogonal, the following would have had
to be true:
1. You have a perfect network engine that takes no resources that might
be used by computation when you either push bytes out or poll for completion
2. Once you start a request (e.g., MPI_Isend), the execution of this
communication request takes no CPU.
3. You can have a very cheap, bound in duration polling operation from
which you return immediately after it checks for your particular
communication request
4. You have something else to do when the polling completion returns
that your request is not done
I would argue that none of these are true in practical scenarios, even
including very smart polling schemes or networks with DMA engines, like
Myrinet.
Here I don't even bring the cases with multithreaded applications. These
are still a fairly small minority.
>
>> how this experience has contributed to the overwhelming use of
>> MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
>> (even better) persistent MPI calls, thus killing any hope that these
>> codes can run faster on systems that actually facilitate overlapping.
>
> There is 2 reasons why developers use blocking operations rather than
> non-blocking one:
> 1) they don't know about non-blocking operations.
> 2) MPI_Send is shorter than MPI_Isend().
Here is a third one. Writing your code for overlapping with non-blocking
MPI calls and segmentation/pipelining, testing the code, and not seeing
any benefit of it.
>
>
> Looking for overlaping is actually not that hard:
> a) look for medium/large messages, don't waste time on small ones.
> b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait()
> c) move the MPI_Isend() as early as possible (as soon as data is ready).
> d) move the MPI_Wait() as late as possible (just before the buffer is
> needed).
> e) do same for receive.
Not quite. Most of the time the message-passing segment of the code you
optimize for overlapping is in the innermost loop of the algorithm - the
one that is most overhead sensitive and usually most optimized. You will
not see common cases where you can "pull" MPI_Send much earlier or push
MPI_Wait much later than where MPI_Send is. So what you usually end up
doing is introducing another loop inside the innermost one, breaking up
the MPI_Send message in a number of segments and pipelining them with
MPI_Isend (or even better MPI_Start) by initiating segment I+1 while
computing with segment I, thus attempting to overlap computation in
stage I with communication in stage I+1. Then, there is the question how
many segments you use to break up the message for maximum speedup. The
pipelining theory says the more you can get the better, when they are
with equal duration, there aren't inter-stage dependencies, and the
stage setup time is low in proportion to the stage execution time. Also,
the size of the segments should be such that the transmission time (not
the whole latency) of the segment is as close as possible to the
computation performed on the segment. I can continue with other factors
that one need to take into account in order to write a good algorithm
with overlapping.
The metric I mentioned earlier "degree of overlapping" with some
additional analysis can help designers _predict_ whether the design is
good or not and whether it will work well or not on a particular system
of interest (including the MPI library).
This is however too much detail for this forum though, as most of the
postings here discuss much more practical issues :)
Rossen
More information about the Beowulf
mailing list