[Beowulf] Re: Re: Home beowulf - NIC latencies

Mon Feb 14 13:22:19 PST 2005

On 14 Feb 2005, at 19:49, Rob Ross wrote:
> On Mon, 14 Feb 2005, Ashley Pittman wrote:
>> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
>>
>> Maybe if you were using a channel interface (sockets) and all messages
>> were to the same remote process then it might make sense to coalesce 
>> all
>> the sends into a single transaction and just send this in the MPI_Wait
>> call.  The latency for a bigger network transaction *might* be lower
>> than the sum of the issue rates for smaller ones.
>
> This is exactly what MPICH2 does for the one-sided calls; see Thakur 
> et.
> al in EuroPVM/MPI 2004.  It can be a very big win in some situations.

I'll look it up.  Presumably the win is because of higher bandwidth 
achieved by larger messages over a stream.  I guess the MPI_Fence call 
copies data out of a receive buffer.

>> I'd hope that a well written application would bunch all it's sends 
>> into
>> a single larger block when possible though if this optimisation was
>> possible though.
>
> We would hope that too, but applications do not always adhere to best
> practice.

As someone who maintains a MPI library I hope people do this, it's up 
to us to provide the functionality and application writers to actually 
make use of it.  There are often times when it may well not be worth 
doing this, either because time to market demands or simply when 
experiments with differing algorithms.

>> Given any reasonably fast network not doing anything until the 
>> MPI_Wait
>> call however would destroy your latency.  It strikes me as this isn't
>> overlapping comms and compute though rather artificially delaying 
>> comms
>> to allow compute to finish, seems rather pointless?
>
> I agree that postponing progress until MPI_Wait for the purposes of
> providing lower CPU utilization would be pointless.  It can be useful 
> for
> coalescing purposes, as mentioned above.  But certainly there will be a
> latency cost.

So potentially there is an optimization choice to me made, do you make 
the "noddy" application run faster at the cost of real performance for 
applications tuned to the particular library?  That sounds like a whole 
can of worms.

>>> All of this is moot of course unless the implementation actually has 
>>> more
>>> than one algorithm that it could employ...
>>
>> In my experience there are often dozens of different algorithms for
>> every situation and each has their trade offs.  Choosing the right one
>> based on the parameters given is the tricky bit.
>
> Absolutely!  And which few of those dozens are applicable to a 
> wide-enough
> range of situations that you want to actually implement/debug them?

Implement? Most of them. Debug/support? no more than two or three seems 
optimal.  There are some algorithms that just don't work on a given 
network and some that will only be best in corner cases.  Then it's 
just a case of choosing the correct thresholds between the remaining 
few.  For a given call *best* is absolute however for a given 
application tradeoffs have to be made.

Ashley,