[Beowulf] Re: Re: Home beowulf - NIC latencies
patrick at myri.com
Mon Feb 14 22:20:52 PST 2005
Rossen Dimitrov wrote:
> Of course, there is always the case of running the actual application
> code and then evaluating the MPI performance by seeing which MPI library
> (or library mode) makes the application run faster. Unfortunately, this
> method for evaluating MPI often suffers from various efficiencies some
> of which originate from the parallel algorithm developers, who thoughout
> the years have sometimes adopted the most trivial ways of using MPI.
So if you run an MPI application and it sucks, this is because the
application is poorly written ?
You don't want to benchmark an application to evaluate MPI, you want to
benchmark an application to find the best set of resources to get the
job done. If the code stinks, it's not an excuse. Good MPI
implementations are good with poorly written applications, but still let
smart people do smart things if they want.
> these in one way or another depend on CPU processing. Also, today's
> processor architectures have many independent processing units and
> complex memory hierarchies. When the MPI library polls for completion of
> a communication request, most of this specialized hardware is virtually
> unused (wasted). The processor architecture trends indicate that this
> kind of internal CPU concurrency will continue to increase, thus making
> the cost of MPI polling even higher.
When you poll, you have nothing else to do: you are stuck in a Wait or
in a blocking call (collectives for example). Why do you care about the
lost cycles ? The only way to rescue them would be to oversubscribe your
processor, and hope than the cycles you recycle (no punt intended) are
worth the context switches and the associated cache trashing. I would
argue that polling should be the cheapest MPI operations ever (if
nothing is found). This is the case of most half decent MPI implementation.
> In this regard, a parallel application developer might actually very
> much care what is actually happening in the MPI library even when he
> makes a call to MPI_Send. If he doesn't, he probably should.
He absolutely should not. It's one thing to work around clueless
developers, but it's way more difficult to work around someone who
assume wrong things about the MPI implementation.
> - What application algorithm developers experience when they attempt to
> use the ever so nebulous "overlapping" with a polling MPI library and
Overlaping is completely orthogonal with polling. Overlaping means that
you split the communication initiation from the communication
completion. Polling means that you test for completion instead of wait
for completion. You can perfectly overlap and check for completion of
the asynchronous requests by polling, nothing wrong with that.
> how this experience has contributed to the overwhelming use of
> MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
> (even better) persistent MPI calls, thus killing any hope that these
> codes can run faster on systems that actually facilitate overlapping.
There is 2 reasons why developers use blocking operations rather than
1) they don't know about non-blocking operations.
2) MPI_Send is shorter than MPI_Isend().
Looking for overlaping is actually not that hard:
a) look for medium/large messages, don't waste time on small ones.
b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait()
c) move the MPI_Isend() as early as possible (as soon as data is ready).
d) move the MPI_Wait() as late as possible (just before the buffer is
e) do same for receive.
Most of the time, that would speed up things quite a bit, or not change
anything. I am still looking for some tuning tool to do that
More information about the Beowulf