[Beowulf] Re: Re: Home beowulf - NIC latencies

Mon Feb 14 14:32:57 PST 2005

Greg Lindahl wrote:
> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
> 
> 
>>Let me ask some stupid's question: which MPI implementations allow
>>really
>> 
>>a) to overlap MPI_Isend w/computations
>>and/or 
>>b) to perform a set of subsequent MPI_Isend calls faster than "the 
>>same" set of MPI_Send calls ?
>>
>>I say only about sending of large messages.
> 
> 
> For large messages, everyone does (b) at least partly right. (a) is
> pretty rare. It's difficult to get (a) right without hurting short
> message performance. One of the commercial MPIs, at first release, had
> very slow short message performance because they thought getting (a)
> right was more important. They've improved their short message
> performance since, but I still haven't seen any real application
> benchmarks that show benefit from their approach.

There is quite a bit of published data that for a number of real 
application codes modest increase of MPI latency for very short messages 
has no impact on the application performance. This can also be seen by 
doing traffic characterization, weighing the relative impact of the 
increased latency, and taking into account the computation/communication 
ratio. On the other hand, what you give the application developers with 
an interrupt-driven MPI library is a higher potential for effective 
overlapping, which they could chose to utilize or not, but unless they 
send only very short messages, they will not see a negative performance 
impact from using this library.

There is evidence that re-coding the MPI part of an application to take 
advantage of overlapping and asynchrony when the MPI library (and 
network) supports these well actually leads to real performance benefit.

There is evidence that even without changing anything in the code, but 
by just running the same code with an MPI library that plays nicer to 
the system leads to better application performance by improving the 
overall "application progress" - a loose term I used to describe all of 
the complex system activities that need to occur during the life-cycle 
of a parallel application not only on a single node, but on all nodes 
collectively.

The question of short message latency is connected to system scalability 
in at least one important scenario - running the same problem size as 
fast as possible by adding more processors. This will lead to smaller 
messages, much more sensitive to overhead, thus negatively impacting 
scalability.

In other practical scenarios though, users increase the problem size as 
the cluster size grows, or they solve multiple instances of the same 
problem concurrently, thus keeping the message sizes away from the 
extremely small sizes resulting from maximum scale runs, thus limiting 
the impact of shortest message latency. I have seen many large clusters 
whose only job run across all nodes is HPL for the top500 number. After 
that, the system is either controlled by a job scheduler, which limits 
the size of jobs to about 30% of all processors (an empirically derived 
number that supposedly improves the overall job throughput), or it is 
physically or logically divided into smaller sub-clusters.

All this being said, there is obviously a large group of codes that use 
small messages no matter what size problem they solve or what the 
cluster size is. For these, the lowest latency will be the most 
important (if not the only) optimization parameter. For these cases, 
users can just run the MPI library in polling mode.

With regard to the assessment that every MPI library does (a) partly 
right I'd like to mention that I have seen behavior where attempting to 
overlap computation and communication can lead to no performance 
improvement at all, or even worse, to performance degradation. This is 
one example of how a particular implementation of a standard API can 
affect the way users code against it. I use a metric called "degree of 
overlapping" which for "good" systems approaches 1, for "bad" systems 
approaches 0, and for terrible systems becomes negative... Here goodness 
is measured as how well the system facilitates overlapping.

Rossen

> 
> -- greg
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf