[Beowulf] Re: Re: Home beowulf - NIC latencies
rross at mcs.anl.gov
Mon Feb 14 11:49:49 PST 2005
On Mon, 14 Feb 2005, Ashley Pittman wrote:
> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
> > If you used the non-blocking send to allow for overlapped communication,
> > then you would like the implementation to play nicely. In this case the
> > user will compute and eventually call MPI_Test or MPI_Wait (or a flavor
> > thereof).
> > If you used the non-blocking sends to post a bunch of communications that
> > you are going to then wait to complete, you probably don't care about the
> > CPU -- you just want the messaging done. In this case the user will call
> > MPI_Wait after posting everything it wants done.
> > One way the implementation *could* behave is to assume the user is trying
> > to overlap comm. and comp. until it sees an MPI_Wait, at which point it
> > could go into this theoretical "burn CPU to make things go faster" mode.
> > That mode could, for example, tweak the interrupt coalescing on an
> > ethernet NIC to process packets more quickly (I don't know off the top of
> > my head if that would work or not; it's just an example).
> Maybe if you were using a channel interface (sockets) and all messages
> were to the same remote process then it might make sense to coalesce all
> the sends into a single transaction and just send this in the MPI_Wait
> call. The latency for a bigger network transaction *might* be lower
> than the sum of the issue rates for smaller ones.
This is exactly what MPICH2 does for the one-sided calls; see Thakur et.
al in EuroPVM/MPI 2004. It can be a very big win in some situations.
> I'd hope that a well written application would bunch all it's sends into
> a single larger block when possible though if this optimisation was
> possible though.
We would hope that too, but applications do not always adhere to best
> Given any reasonably fast network not doing anything until the MPI_Wait
> call however would destroy your latency. It strikes me as this isn't
> overlapping comms and compute though rather artificially delaying comms
> to allow compute to finish, seems rather pointless?
I agree that postponing progress until MPI_Wait for the purposes of
providing lower CPU utilization would be pointless. It can be useful for
coalescing purposes, as mentioned above. But certainly there will be a
> If you had a bunch of sends to do to N remote processes then I'd expect
> you to post them in order (non-blocking) and wait for them all at the
> end, the time taken to do this should be (base_latency + ( (N-1) * M ))
> where M is the recpipiocal of the "issue rate". You can clearly see
> here that even for small number of batched sends (even a 2d/3d nearest
> neighbour matrix) the issue rate (that is how little CPU the send call
> consumes) is at least as important that the raw latency.
Well I wasn't trying to start an argument about the importance of CPU
utilization as it relates to issue rate :). The original question simply
asked if there was generally an advantage to doing what you expect people
to do anyway! And I think that we agree the answer is yes.
> > All of this is moot of course unless the implementation actually has more
> > than one algorithm that it could employ...
> In my experience there are often dozens of different algorithms for
> every situation and each has their trade offs. Choosing the right one
> based on the parameters given is the tricky bit.
Absolutely! And which few of those dozens are applicable to a wide-enough
range of situations that you want to actually implement/debug them?
More information about the Beowulf