[Beowulf] Re: Re: Home beowulf - NIC latencies
patrick at myri.com
Wed Feb 16 03:28:00 PST 2005
Rossen Dimitrov wrote:
>> So if you run an MPI application and it sucks, this is because the
>> application is poorly written ?
> Patrick, here the argument is about whether and how you "measure" the
> "performance of MPI". I guess you may have missed some of the preceding
No, I was pulling your leg :-) The bigger picture is that MPI has no
performance in itself, it's a middleware. You can only measure the way
an MPI implementation enable a specific application to perform. Only
benchmarking of applications is meaningful, you can argue that
everything else is futile and bogus.
>> You don't want to benchmark an application to evaluate MPI, you want
>> to benchmark an application to find the best set of resources to get
>> the job done. If the code stinks, it's not an excuse. Good MPI
>> implementations are good with poorly written applications, but still
>> let smart people do smart things if they want.
> This is exactly my point made in my previous posting - you cannot design
> a system that is optimal in a single mode for all cases of its use when
> there are multiple parameters defining the usage and performance
I agree completely, being able to apply different assumptions for the
whole code and see which one match the best the applications behavior is
better than nothing. However, I believe that some tradeoffs are just too
intrusive: you should not have to choose between low latency for small
messages or progress by interrupt for large ones, especially when you
can have both at the same time.
> I think it is fairly easy to show that overlapping and polling (or any
> kind of communication completion synchronization) are not orthogonal. If
> this was the case, you would see codes that show perfect overlapping
> running on any MPI implementation/network pair. I am sure there is
> plenty of evidence this is not the case.
I can show you codes where people sprinkled some MPI_Test()s in some
loops. They don't poll to death, just a little from time to time to
improve overlap by improving progression. They poll and they overlap.
They could as well block and not overlap. polling/blocking and
overlap/not are not linked. Interrupts are useful to get overlap without
help from the application, but it's not required to overlap.
> There is an important point here that needs to be clarified: when I say
> "polling" library, I assume that this library does both: polling
> completion synchronization and polling progress. There is not much room
> to define here these but I am sure MPI developers know what they are.
I think this is where we don't understand each other. For me, polling
means no interrupts. Wherever you progress in the context of MPI calls
or in the context of a progression thread, you pay for the same CPU
cyles. If the application is providing CPU cycles to the MPI lib at the
right time, you can overlap perfectly without wasting cycles.
> Here is a third one. Writing your code for overlapping with non-blocking
> MPI calls and segmentation/pipelining, testing the code, and not seeing
> any benefit of it.
Yes. This is very true. But if it's not worse than with blocking, they
should stick with non-blocking, even if it's bigger and more confusing.
> stage I with communication in stage I+1. Then, there is the question how
> many segments you use to break up the message for maximum speedup. The
> pipelining theory says the more you can get the better, when they are
> with equal duration, there aren't inter-stage dependencies, and the
> stage setup time is low in proportion to the stage execution time. Also,
The more steps, the more overhead. Small pipeline stages decrease your
startup overhead (when the second stage is empty) but increase the
number of segments and the total cost of the pipeline. The best is to
find a piece of computation long enough to hide the communication.
Pipelining would be overkill in my opinion.
> The metric I mentioned earlier "degree of overlapping" with some
> additional analysis can help designers _predict_ whether the design is
> good or not and whether it will work well or not on a particular system
> of interest (including the MPI library).
Temporal dependency between buffers and computation is the metric for
overlaping. The longuer you don't need a buffers, the better you can
overlap a communication to/from it. Compilers could know that.
> This is however too much detail for this forum though, as most of the
> postings here discuss much more practical issues :)
I am bored with cooling questions. However, it's quite time consuming to
argue by email. I don't know how RGB can keep the distance :-)
More information about the Beowulf