[Beowulf] Re: Re: Home beowulf - NIC latencies
ashley at quadrics.com
Mon Feb 14 13:22:19 PST 2005
On 14 Feb 2005, at 19:49, Rob Ross wrote:
> On Mon, 14 Feb 2005, Ashley Pittman wrote:
>> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
>> Maybe if you were using a channel interface (sockets) and all messages
>> were to the same remote process then it might make sense to coalesce
>> the sends into a single transaction and just send this in the MPI_Wait
>> call. The latency for a bigger network transaction *might* be lower
>> than the sum of the issue rates for smaller ones.
> This is exactly what MPICH2 does for the one-sided calls; see Thakur
> al in EuroPVM/MPI 2004. It can be a very big win in some situations.
I'll look it up. Presumably the win is because of higher bandwidth
achieved by larger messages over a stream. I guess the MPI_Fence call
copies data out of a receive buffer.
>> I'd hope that a well written application would bunch all it's sends
>> a single larger block when possible though if this optimisation was
>> possible though.
> We would hope that too, but applications do not always adhere to best
As someone who maintains a MPI library I hope people do this, it's up
to us to provide the functionality and application writers to actually
make use of it. There are often times when it may well not be worth
doing this, either because time to market demands or simply when
experiments with differing algorithms.
>> Given any reasonably fast network not doing anything until the
>> call however would destroy your latency. It strikes me as this isn't
>> overlapping comms and compute though rather artificially delaying
>> to allow compute to finish, seems rather pointless?
> I agree that postponing progress until MPI_Wait for the purposes of
> providing lower CPU utilization would be pointless. It can be useful
> coalescing purposes, as mentioned above. But certainly there will be a
> latency cost.
So potentially there is an optimization choice to me made, do you make
the "noddy" application run faster at the cost of real performance for
applications tuned to the particular library? That sounds like a whole
can of worms.
>>> All of this is moot of course unless the implementation actually has
>>> than one algorithm that it could employ...
>> In my experience there are often dozens of different algorithms for
>> every situation and each has their trade offs. Choosing the right one
>> based on the parameters given is the tricky bit.
> Absolutely! And which few of those dozens are applicable to a
> range of situations that you want to actually implement/debug them?
Implement? Most of them. Debug/support? no more than two or three seems
optimal. There are some algorithms that just don't work on a given
network and some that will only be best in corner cases. Then it's
just a case of choosing the correct thresholds between the remaining
few. For a given call *best* is absolute however for a given
application tradeoffs have to be made.
More information about the Beowulf