[Beowulf] Re: Re: Home beowulf - NIC latencies
Vincent Diepeveen
diep at xs4all.nl
Wed Feb 16 05:12:23 PST 2005
At 06:28 16-2-2005 -0500, Patrick Geoffray wrote:
>Rossen,
>
>Rossen Dimitrov wrote:
>>
>>>
>>> So if you run an MPI application and it sucks, this is because the
>>> application is poorly written ?
>>
>>
>> Patrick, here the argument is about whether and how you "measure" the
>> "performance of MPI". I guess you may have missed some of the preceding
>> postings.
>
>No, I was pulling your leg :-) The bigger picture is that MPI has no
>performance in itself, it's a middleware. You can only measure the way
>an MPI implementation enable a specific application to perform. Only
>benchmarking of applications is meaningful, you can argue that
>everything else is futile and bogus.
A problem of MPI over DSM type forms of parallellism has been described
very well by Chrilly Donninger with respect to his chessprogram Hydra which
runs at a few nodes MPI :
For every write :
MPI_Isend(....)
MPI_Test(&Reg,&flg,&Stat)
while(!flg) {
Hydra_MsgPending(); // Important, read in messages and process them
while waiting on complete. Otherwise the own Input-Buffer can overflow
// and we get a deadlock.
MPI_Test(&Reg,&flg,&Stat);
}
The above is dead slow simply and delays the software.
In a DSM model like Quadrics you don't have all these delays.
Can Myri memory on the card (4MB and 8MB in the $1500 version) get used to
directly write to the RAM on a remote network card?
If so which library can i download for that for myri cards?
Thanks in advance,
Vincent
>>> You don't want to benchmark an application to evaluate MPI, you want
>>> to benchmark an application to find the best set of resources to get
>>> the job done. If the code stinks, it's not an excuse. Good MPI
>>> implementations are good with poorly written applications, but still
>>> let smart people do smart things if they want.
>>
>>
>> This is exactly my point made in my previous posting - you cannot design
>> a system that is optimal in a single mode for all cases of its use when
>> there are multiple parameters defining the usage and performance
>
>I agree completely, being able to apply different assumptions for the
>whole code and see which one match the best the applications behavior is
>better than nothing. However, I believe that some tradeoffs are just too
>intrusive: you should not have to choose between low latency for small
>messages or progress by interrupt for large ones, especially when you
>can have both at the same time.
>
>> I think it is fairly easy to show that overlapping and polling (or any
>> kind of communication completion synchronization) are not orthogonal. If
>> this was the case, you would see codes that show perfect overlapping
>> running on any MPI implementation/network pair. I am sure there is
>> plenty of evidence this is not the case.
>
>I can show you codes where people sprinkled some MPI_Test()s in some
>loops. They don't poll to death, just a little from time to time to
>improve overlap by improving progression. They poll and they overlap.
>They could as well block and not overlap. polling/blocking and
>overlap/not are not linked. Interrupts are useful to get overlap without
>help from the application, but it's not required to overlap.
>
>> There is an important point here that needs to be clarified: when I say
>> "polling" library, I assume that this library does both: polling
>> completion synchronization and polling progress. There is not much room
>> to define here these but I am sure MPI developers know what they are.
>
>I think this is where we don't understand each other. For me, polling
>means no interrupts. Wherever you progress in the context of MPI calls
>or in the context of a progression thread, you pay for the same CPU
>cyles. If the application is providing CPU cycles to the MPI lib at the
>right time, you can overlap perfectly without wasting cycles.
>
>> Here is a third one. Writing your code for overlapping with non-blocking
>> MPI calls and segmentation/pipelining, testing the code, and not seeing
>> any benefit of it.
>
>Yes. This is very true. But if it's not worse than with blocking, they
>should stick with non-blocking, even if it's bigger and more confusing.
>
>> stage I with communication in stage I+1. Then, there is the question how
>> many segments you use to break up the message for maximum speedup. The
>> pipelining theory says the more you can get the better, when they are
>> with equal duration, there aren't inter-stage dependencies, and the
>> stage setup time is low in proportion to the stage execution time. Also,
>
>The more steps, the more overhead. Small pipeline stages decrease your
>startup overhead (when the second stage is empty) but increase the
>number of segments and the total cost of the pipeline. The best is to
>find a piece of computation long enough to hide the communication.
>Pipelining would be overkill in my opinion.
>
>> The metric I mentioned earlier "degree of overlapping" with some
>> additional analysis can help designers _predict_ whether the design is
>> good or not and whether it will work well or not on a particular system
>> of interest (including the MPI library).
>
>Temporal dependency between buffers and computation is the metric for
>overlaping. The longuer you don't need a buffers, the better you can
>overlap a communication to/from it. Compilers could know that.
>
>> This is however too much detail for this forum though, as most of the
>> postings here discuss much more practical issues :)
>
>I am bored with cooling questions. However, it's quite time consuming to
>argue by email. I don't know how RGB can keep the distance :-)
>
>Patrick
>--
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
More information about the Beowulf
mailing list