[Beowulf] Re: Re: Home beowulf - NIC latencies

Mon Feb 14 21:12:36 PST 2005

Rossen,

It would be good to mention that you work for a company that sells an
implementation specifically designed for facilitating overlapping, in case
people don't know that.  Clearly you guys have thought a lot about this.

The last two Scalable OS workshops (the only two I've had a chance to 
attend), there was a contingent of people that are certain that MPI isn't 
going to last too much longer as a programming model for very large 
systems.  The issue, as they see it, is that MPI simply imposes too much 
latency on communication, and because we (as MPI implementors) cannot 
decrease that latency fast enough to keep up with processor improvements, 
MPI will soon become too expensive to be of use on these systems.

Now, I don't personally think that this is going to happen as quickly as
some predict, but it is certainly an argument that we should be paying
very careful attention to the latency issue, because as MPI implementors 
this is an argument that never seems to end.

Also, there is additional overhead in the Isend()/Wait() pair over the
simple Send() (two function calls rather than one, allocation of a Request
structure at the least) that means that a naive attempt at overlapping
communication and computation will result in a slower application.  So
that doesn't surprise me at all.

I think that the theme from this thread should be that "it's a good thing
that we have more than one MPI implementation, because they all do
different things best."

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab

On Mon, 14 Feb 2005, Rossen Dimitrov wrote:

> There is quite a bit of published data that for a number of real 
> application codes modest increase of MPI latency for very short messages 
> has no impact on the application performance. This can also be seen by 
> doing traffic characterization, weighing the relative impact of the 
> increased latency, and taking into account the computation/communication 
> ratio. On the other hand, what you give the application developers with 
> an interrupt-driven MPI library is a higher potential for effective 
> overlapping, which they could chose to utilize or not, but unless they 
> send only very short messages, they will not see a negative performance 
> impact from using this library.
> 
> There is evidence that re-coding the MPI part of an application to take 
> advantage of overlapping and asynchrony when the MPI library (and 
> network) supports these well actually leads to real performance benefit.
> 
> There is evidence that even without changing anything in the code, but 
> by just running the same code with an MPI library that plays nicer to 
> the system leads to better application performance by improving the 
> overall "application progress" - a loose term I used to describe all of 
> the complex system activities that need to occur during the life-cycle 
> of a parallel application not only on a single node, but on all nodes 
> collectively.
> 
> The question of short message latency is connected to system scalability 
> in at least one important scenario - running the same problem size as 
> fast as possible by adding more processors. This will lead to smaller 
> messages, much more sensitive to overhead, thus negatively impacting 
> scalability.
> 
> In other practical scenarios though, users increase the problem size as 
> the cluster size grows, or they solve multiple instances of the same 
> problem concurrently, thus keeping the message sizes away from the 
> extremely small sizes resulting from maximum scale runs, thus limiting 
> the impact of shortest message latency. I have seen many large clusters 
> whose only job run across all nodes is HPL for the top500 number. After 
> that, the system is either controlled by a job scheduler, which limits 
> the size of jobs to about 30% of all processors (an empirically derived 
> number that supposedly improves the overall job throughput), or it is 
> physically or logically divided into smaller sub-clusters.
> 
> All this being said, there is obviously a large group of codes that use 
> small messages no matter what size problem they solve or what the 
> cluster size is. For these, the lowest latency will be the most 
> important (if not the only) optimization parameter. For these cases, 
> users can just run the MPI library in polling mode.
> 
> With regard to the assessment that every MPI library does (a) partly 
> right I'd like to mention that I have seen behavior where attempting to 
> overlap computation and communication can lead to no performance 
> improvement at all, or even worse, to performance degradation. This is 
> one example of how a particular implementation of a standard API can 
> affect the way users code against it. I use a metric called "degree of 
> overlapping" which for "good" systems approaches 1, for "bad" systems 
> approaches 0, and for terrible systems becomes negative... Here goodness 
> is measured as how well the system facilitates overlapping.
> 
> Rossen