[Beowulf] Re: Re: Home beowulf - NIC latencies

Mon Feb 14 22:27:41 PST 2005

Rob, I agree that by now it is well understood that by providing a very 
flexible API with a rich set of semantics, MPI may have missed some 
opportunities for accelerating message passing in some constrained 
cases. Many of us have seen codes that not only use just the famous 6 
MPI functions, but also avoid wild cards and out-of-order messages. As a 
result, these codes pay for services they don't use.

As far as the predicted end-of-life for MPI, I wouldn't necessarily bet 
on it. As often happens, the technical reasons may have little to do 
with the issue. By now MPI has had penetration in so many long-term 
programs that it will be around for quite a while. Of course, this does 
not mean that there would not be attempts to "fix" it or replace it with 
something else. This might in fact be a good thing - natural evolution 
of technology.

Rossen
Verari Systems Software

Rob Ross wrote:
> Rossen,
> 
> It would be good to mention that you work for a company that sells an
> implementation specifically designed for facilitating overlapping, in case
> people don't know that.  Clearly you guys have thought a lot about this.
> 
> The last two Scalable OS workshops (the only two I've had a chance to 
> attend), there was a contingent of people that are certain that MPI isn't 
> going to last too much longer as a programming model for very large 
> systems.  The issue, as they see it, is that MPI simply imposes too much 
> latency on communication, and because we (as MPI implementors) cannot 
> decrease that latency fast enough to keep up with processor improvements, 
> MPI will soon become too expensive to be of use on these systems.
> 
> Now, I don't personally think that this is going to happen as quickly as
> some predict, but it is certainly an argument that we should be paying
> very careful attention to the latency issue, because as MPI implementors 
> this is an argument that never seems to end.
> 
> Also, there is additional overhead in the Isend()/Wait() pair over the
> simple Send() (two function calls rather than one, allocation of a Request
> structure at the least) that means that a naive attempt at overlapping
> communication and computation will result in a slower application.  So
> that doesn't surprise me at all.
> 
> I think that the theme from this thread should be that "it's a good thing
> that we have more than one MPI implementation, because they all do
> different things best."
> 
> Rob
> ---
> Rob Ross, Mathematics and Computer Science Division, Argonne National Lab
> 
> 
> On Mon, 14 Feb 2005, Rossen Dimitrov wrote:
> 
> 
>>There is quite a bit of published data that for a number of real 
>>application codes modest increase of MPI latency for very short messages 
>>has no impact on the application performance. This can also be seen by 
>>doing traffic characterization, weighing the relative impact of the 
>>increased latency, and taking into account the computation/communication 
>>ratio. On the other hand, what you give the application developers with 
>>an interrupt-driven MPI library is a higher potential for effective 
>>overlapping, which they could chose to utilize or not, but unless they 
>>send only very short messages, they will not see a negative performance 
>>impact from using this library.
>>
>>There is evidence that re-coding the MPI part of an application to take 
>>advantage of overlapping and asynchrony when the MPI library (and 
>>network) supports these well actually leads to real performance benefit.
>>
>>There is evidence that even without changing anything in the code, but 
>>by just running the same code with an MPI library that plays nicer to 
>>the system leads to better application performance by improving the 
>>overall "application progress" - a loose term I used to describe all of 
>>the complex system activities that need to occur during the life-cycle 
>>of a parallel application not only on a single node, but on all nodes 
>>collectively.
>>
>>The question of short message latency is connected to system scalability 
>>in at least one important scenario - running the same problem size as 
>>fast as possible by adding more processors. This will lead to smaller 
>>messages, much more sensitive to overhead, thus negatively impacting 
>>scalability.
>>
>>In other practical scenarios though, users increase the problem size as 
>>the cluster size grows, or they solve multiple instances of the same 
>>problem concurrently, thus keeping the message sizes away from the 
>>extremely small sizes resulting from maximum scale runs, thus limiting 
>>the impact of shortest message latency. I have seen many large clusters 
>>whose only job run across all nodes is HPL for the top500 number. After 
>>that, the system is either controlled by a job scheduler, which limits 
>>the size of jobs to about 30% of all processors (an empirically derived 
>>number that supposedly improves the overall job throughput), or it is 
>>physically or logically divided into smaller sub-clusters.
>>
>>All this being said, there is obviously a large group of codes that use 
>>small messages no matter what size problem they solve or what the 
>>cluster size is. For these, the lowest latency will be the most 
>>important (if not the only) optimization parameter. For these cases, 
>>users can just run the MPI library in polling mode.
>>
>>With regard to the assessment that every MPI library does (a) partly 
>>right I'd like to mention that I have seen behavior where attempting to 
>>overlap computation and communication can lead to no performance 
>>improvement at all, or even worse, to performance degradation. This is 
>>one example of how a particular implementation of a standard API can 
>>affect the way users code against it. I use a metric called "degree of 
>>overlapping" which for "good" systems approaches 1, for "bad" systems 
>>approaches 0, and for terrible systems becomes negative... Here goodness 
>>is measured as how well the system facilitates overlapping.
>>
>>Rossen
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf