[Beowulf] Re: Re: Home beowulf - NIC latencies

Mon Feb 14 22:20:52 PST 2005

Hi Rossen,

Rossen Dimitrov wrote:
> Of course, there is always the case of running the actual application
> code and then evaluating the MPI performance by seeing which MPI library
> (or library mode) makes the application run faster. Unfortunately, this
> method for evaluating MPI often suffers from various efficiencies some
> of which originate from the parallel algorithm developers, who thoughout
> the years have sometimes adopted the most trivial ways of using MPI.

So if you run an MPI application and it sucks, this is because the 
application is poorly written ?

You don't want to benchmark an application to evaluate MPI, you want to 
benchmark an application to find the best set of resources to get the 
job done. If the code stinks, it's not an excuse. Good MPI 
implementations are good with poorly written applications, but still let 
smart people do smart things if they want.

> these in one way or another depend on CPU processing. Also, today's 
> processor architectures have many independent processing units and 
> complex memory hierarchies. When the MPI library polls for completion of 
> a communication request, most of this specialized hardware is virtually 
> unused (wasted). The processor architecture trends indicate that this 
> kind of internal CPU concurrency will continue to increase, thus making 
> the cost of MPI polling even higher.

When you poll, you have nothing else to do: you are stuck in a Wait or 
in a blocking call (collectives for example). Why do you care about the 
lost cycles ? The only way to rescue them would be to oversubscribe your 
processor, and hope than the cycles you recycle (no punt intended) are 
worth the context switches and the associated cache trashing. I would 
argue that polling should be the cheapest MPI operations ever (if 
nothing is found). This is the case of most half decent MPI implementation.

> In this regard, a parallel application developer might actually very
> much care what is actually happening in the MPI library even when he 
> makes a call to MPI_Send. If he doesn't, he probably should.

He absolutely should not. It's one thing to work around clueless 
developers, but it's way more difficult to work around someone who 
assume wrong things about the MPI implementation.

> - What application algorithm developers experience when they attempt to
> use the ever so nebulous "overlapping" with a polling MPI library and

Overlaping is completely orthogonal with polling. Overlaping means that 
you split the communication initiation from the communication 
completion. Polling means that you test for completion instead of wait 
for completion. You can perfectly overlap and check for completion of 
the asynchronous requests by polling, nothing wrong with that.

> how this experience has contributed to the overwhelming use of
> MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
> (even better) persistent MPI calls, thus killing any hope that these
> codes can run faster on systems that actually facilitate overlapping.

There is 2 reasons why developers use blocking operations rather than 
non-blocking one:
1) they don't know about non-blocking operations.
2) MPI_Send is shorter than MPI_Isend().

Looking for overlaping is actually not that hard:
a) look for medium/large messages, don't waste time on small ones.
b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait()
c) move the MPI_Isend() as early as possible (as soon as data is ready).
d) move the MPI_Wait() as late as possible (just before the buffer is 
needed).
e) do same for receive.

Most of the time, that would speed up things quite a bit, or not change 
anything. I am still looking for some tuning tool to do that 
automatically though.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com