[Beowulf] Re: Re: Home beowulf - NIC latencies
Rossen Dimitrov
rossen at VerariSoft.Com
Mon Feb 14 14:32:57 PST 2005
Greg Lindahl wrote:
> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
>
>
>>Let me ask some stupid's question: which MPI implementations allow
>>really
>>
>>a) to overlap MPI_Isend w/computations
>>and/or
>>b) to perform a set of subsequent MPI_Isend calls faster than "the
>>same" set of MPI_Send calls ?
>>
>>I say only about sending of large messages.
>
>
> For large messages, everyone does (b) at least partly right. (a) is
> pretty rare. It's difficult to get (a) right without hurting short
> message performance. One of the commercial MPIs, at first release, had
> very slow short message performance because they thought getting (a)
> right was more important. They've improved their short message
> performance since, but I still haven't seen any real application
> benchmarks that show benefit from their approach.
There is quite a bit of published data that for a number of real
application codes modest increase of MPI latency for very short messages
has no impact on the application performance. This can also be seen by
doing traffic characterization, weighing the relative impact of the
increased latency, and taking into account the computation/communication
ratio. On the other hand, what you give the application developers with
an interrupt-driven MPI library is a higher potential for effective
overlapping, which they could chose to utilize or not, but unless they
send only very short messages, they will not see a negative performance
impact from using this library.
There is evidence that re-coding the MPI part of an application to take
advantage of overlapping and asynchrony when the MPI library (and
network) supports these well actually leads to real performance benefit.
There is evidence that even without changing anything in the code, but
by just running the same code with an MPI library that plays nicer to
the system leads to better application performance by improving the
overall "application progress" - a loose term I used to describe all of
the complex system activities that need to occur during the life-cycle
of a parallel application not only on a single node, but on all nodes
collectively.
The question of short message latency is connected to system scalability
in at least one important scenario - running the same problem size as
fast as possible by adding more processors. This will lead to smaller
messages, much more sensitive to overhead, thus negatively impacting
scalability.
In other practical scenarios though, users increase the problem size as
the cluster size grows, or they solve multiple instances of the same
problem concurrently, thus keeping the message sizes away from the
extremely small sizes resulting from maximum scale runs, thus limiting
the impact of shortest message latency. I have seen many large clusters
whose only job run across all nodes is HPL for the top500 number. After
that, the system is either controlled by a job scheduler, which limits
the size of jobs to about 30% of all processors (an empirically derived
number that supposedly improves the overall job throughput), or it is
physically or logically divided into smaller sub-clusters.
All this being said, there is obviously a large group of codes that use
small messages no matter what size problem they solve or what the
cluster size is. For these, the lowest latency will be the most
important (if not the only) optimization parameter. For these cases,
users can just run the MPI library in polling mode.
With regard to the assessment that every MPI library does (a) partly
right I'd like to mention that I have seen behavior where attempting to
overlap computation and communication can lead to no performance
improvement at all, or even worse, to performance degradation. This is
one example of how a particular implementation of a standard API can
affect the way users code against it. I use a metric called "degree of
overlapping" which for "good" systems approaches 1, for "bad" systems
approaches 0, and for terrible systems becomes negative... Here goodness
is measured as how well the system facilitates overlapping.
Rossen
>
> -- greg
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list