[Beowulf] Re: Re: Home beowulf - NIC latencies

Wed Feb 16 05:12:23 PST 2005

At 06:28 16-2-2005 -0500, Patrick Geoffray wrote:
>Rossen,
>
>Rossen Dimitrov wrote:
>> 
>>>
>>> So if you run an MPI application and it sucks, this is because the 
>>> application is poorly written ?
>> 
>> 
>> Patrick, here the argument is about whether and how you "measure" the 
>> "performance of MPI". I guess you may have missed some of the preceding 
>> postings.
>
>No, I was pulling your leg :-) The bigger picture is that MPI has no 
>performance in itself, it's a middleware. You can only measure the way 
>an MPI implementation enable a specific application to perform. Only 
>benchmarking of applications is meaningful, you can argue that 
>everything else is futile and bogus.

A problem of MPI over DSM type forms of parallellism has been described
very well by Chrilly Donninger with respect to his chessprogram Hydra which
runs at a few nodes MPI :

For every write :

MPI_Isend(....)
MPI_Test(&Reg,&flg,&Stat)
while(!flg) {
    Hydra_MsgPending();  // Important, read in messages and process them
while waiting on complete. Otherwise the own Input-Buffer can overflow
                                         // and we get a deadlock.
    MPI_Test(&Reg,&flg,&Stat);
}

The above is dead slow simply and delays the software.

In a DSM model like Quadrics you don't have all these delays.

Can Myri memory on the card (4MB and 8MB in the $1500 version) get used to
directly write to the RAM on a remote network card?

If so which library can i download for that for myri cards?

Thanks in advance,
Vincent

>>> You don't want to benchmark an application to evaluate MPI, you want 
>>> to benchmark an application to find the best set of resources to get 
>>> the job done. If the code stinks, it's not an excuse. Good MPI 
>>> implementations are good with poorly written applications, but still 
>>> let smart people do smart things if they want.
>> 
>> 
>> This is exactly my point made in my previous posting - you cannot design 
>> a system that is optimal in a single mode for all cases of its use when 
>> there are multiple parameters defining the usage and performance 
>
>I agree completely, being able to apply different assumptions for the 
>whole code and see which one match the best the applications behavior is 
>better than nothing. However, I believe that some tradeoffs are just too 
>intrusive: you should not have to choose between low latency for small 
>messages or progress by interrupt for large ones, especially when you 
>can have both at the same time.
>
>> I think it is fairly easy to show that overlapping and polling (or any 
>> kind of communication completion synchronization) are not orthogonal. If 
>> this was the case, you would see codes that show perfect overlapping 
>> running on any MPI implementation/network pair. I am sure there is 
>> plenty of evidence this is not the case.
>
>I can show you codes where people sprinkled some MPI_Test()s in some 
>loops. They don't poll to death, just a little from time to time to 
>improve overlap by improving progression. They poll and they overlap. 
>They could as well block and not overlap. polling/blocking and 
>overlap/not are not linked. Interrupts are useful to get overlap without 
>help from the application, but it's not required to overlap.
>
>> There is an important point here that needs to be clarified: when I say 
>> "polling" library, I assume that this library does both: polling 
>> completion synchronization and polling progress. There is not much room 
>> to define here these but I am sure MPI developers know what they are.
>
>I think this is where we don't understand each other. For me, polling 
>means no interrupts. Wherever you progress in the context of MPI calls 
>or in the context of a progression thread, you pay for the same CPU 
>cyles. If the application is providing CPU cycles to the MPI lib at the 
>right time, you can overlap perfectly without wasting cycles.
>
>> Here is a third one. Writing your code for overlapping with non-blocking 
>> MPI calls and segmentation/pipelining, testing the code, and not seeing 
>> any benefit of it.
>
>Yes. This is very true. But if it's not worse than with blocking, they 
>should stick with non-blocking, even if it's bigger and more confusing.
>
>> stage I with communication in stage I+1. Then, there is the question how 
>> many segments you use to break up the message for maximum speedup. The 
>> pipelining theory says the more you can get the better, when they are 
>> with equal duration, there aren't inter-stage dependencies, and the 
>> stage setup time is low in proportion to the stage execution time. Also, 
>
>The more steps, the more overhead. Small pipeline stages decrease your 
>startup overhead (when the second stage is empty) but increase the 
>number of segments and the total cost of the pipeline. The best is to 
>find a piece of computation long enough to hide the communication. 
>Pipelining would be overkill in my opinion.
>
>> The metric I mentioned earlier "degree of overlapping" with some 
>> additional analysis can help designers _predict_ whether the design is 
>> good or not and whether it will work well or not on a particular system 
>> of interest (including the MPI library).
>
>Temporal dependency between buffers and computation is the metric for 
>overlaping. The longuer you don't need a buffers, the better you can 
>overlap a communication to/from it. Compilers could know that.
>
>> This is however too much detail for this forum though, as most of the 
>> postings here discuss much more practical issues :)
>
>I am bored with cooling questions. However, it's quite time consuming to 
>argue by email. I don't know how RGB can keep the distance :-)
>
>Patrick
>-- 
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>