[Beowulf] Re: Re: Home beowulf - NIC latencies

Fri Feb 11 22:52:03 PST 2005

I think that the mere definition of the term "MPI performance" and
focusing too much on it can potentially have a negative impact on the
overall discussion of parallel performance. Accepting the premise that
all MPI can do is push individual messages between user processes as
fast as possible, (as measured by ping pong) regardless of how this is
achieved, unnecessarily and, I'd say, unjustifiably restricts the field
of discussion. I agree that today MPI libraries are commonly measured by
their ping-pong "performance" and not by their CPU utilization or other
factors, but it does not necessarily make this form of performance
evaluation right.

I would support the idea of discussing isolated "MPI performance" but
only in the context of a broader performance parameter space, at least
including, communication overhead, communication bandwidth, processor
overhead, and ability to perform asynchronous communication (i.e.,
compliance to the MPI Progress Rule). Only in such a broader evaluation
space one can hope to fit the large number of combinations of
processor/memory/peripheral_fabric architectures, network interconnects,
system software/middleware, and application algorithms.

Of course, there is always the case of running the actual application
code and then evaluating the MPI performance by seeing which MPI library
(or library mode) makes the application run faster. Unfortunately, this
method for evaluating MPI often suffers from various efficiencies some
of which originate from the parallel algorithm developers, who thoughout
the years have sometimes adopted the most trivial ways of using MPI.

Here a couple of arguments for why it is important to look at MPI (and 
the whole communication system) from different angles. If certain MPI
optimizations are achieved at the cost of excessive use of resources
that otherwise could be used for computation or enabling the overall
"application_progress", the actual application performance may be below
its potential or even degrade. Here are some "application progress"
activities that can benefit of having these resources at their disposal:
OS/kernel processing, other communication, I/O operations, memory
operations (prefetching, etc.), peripheral bus/fabric operations. All of
these in one way or another depend on CPU processing. Also, today's 
processor architectures have many independent processing units and 
complex memory hierarchies. When the MPI library polls for completion of 
a communication request, most of this specialized hardware is virtually 
unused (wasted). The processor architecture trends indicate that this 
kind of internal CPU concurrency will continue to increase, thus making 
the cost of MPI polling even higher.

In this regard, a parallel application developer might actually very
much care what is actually happening in the MPI library even when he 
makes a call to MPI_Send. If he doesn't, he probably should.

Some related topics (not covered here because of bloviating) are:

- How an MPI library that maximizes MPI's ping-pong performance alone 
can cause unexpected behavior and a fully functional parallel system to 
work far below its realistic efficiency.

- What application algorithm developers experience when they attempt to
use the ever so nebulous "overlapping" with a polling MPI library and
how this experience has contributed to the overwhelming use of
MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
(even better) persistent MPI calls, thus killing any hope that these
codes can run faster on systems that actually facilitate overlapping.

Rossen

Rob Ross wrote:
> Hi Isaac,
> 
> On Fri, 11 Feb 2005, Isaac Dooley wrote:
> 
> 
>>>>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the
>>>>completion of a message transaction.
>>>
>>>No, it allows the programmer to express that it wants to send a message 
>>>but not wait for it to complete right now.  The API doesn't specify the 
>>>semantics of CPU utilization.  It cannot, because the API doesn't have 
>>>knowledge of the hardware that will be used in the implementation.
>>>
>>
>>That is partially true.  The context for my comment was under your 
>>assumption that everyone uses MPI_Send(). These people, as I stated 
>>before, do not care about what the CPU does during their blocking calls.
> 
> 
> I think that it is completely true.  I made no assumption about everyone 
> using MPI_Send(); I'm a late-comer to the conversation. 
> 
> I was not trying to say anything about what people making the calls care
> about; I was trying to clarify what the standard does and does not say.  
> However, I agree with you that it is unlikely that someone calling
> MPI_Send() is too worried about what the CPU utilization is during the
> call.
> 
> 
>>I was trying to point out that programs utilizing non-blocking IO may 
>>have work that will be adversely impacted by CPU utilization for 
>>messaging. These are the people who care about CPU utilization for 
>>messaging. This I hopes answers your prior question, at least partially.
> 
> 
> I agree that people using MPI_Isend() and related non-blocking operations 
> are sometimes doing so because they would like to perform some 
> computation while the communication progresses.  People also use these 
> calls to initiate a collection of point-to-point operations before 
> waiting, so that multiple communications may proceed in parallel.  The 
> implementation has no way of really knowing which of these is the case.
> 
> Greg just pointed out that for small messages most implementations will do
> the exact same thing as in the MPI_Send() case anyway.  For large messages
> I suppose that something different could be done.  In our implementation
> (MPICH2), to my knowledge we do not differentiate.
> 
> You should understand that the way MPI implementations are measured is by 
> their performance, not CPU utilization, so there is pressure to push the 
> former as much as possible at the expense of the latter.
> 
> 
>>Perhaps your applications demand low latency with no concern for the CPU 
>>during the time spent blocking. That is fine. But some applications 
>>benefit from overlapping computation and communication, and the cycles 
>>not wasted by the CPU on communication can be used productively.
> 
> 
> I wouldn't categorize the cycles spent on communication as "wasted"; it's 
> not like we code in extraneous math just to keep the CPU pegged :).
> 
> Regards,
> 
> Rob
> ---
> Rob Ross, Mathematics and Computer Science Division, Argonne National Lab
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf