[Beowulf] Re: Re: Home beowulf - NIC latencies

Mon Feb 14 11:49:49 PST 2005

On Mon, 14 Feb 2005, Ashley Pittman wrote:

> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
> > If you used the non-blocking send to allow for overlapped communication, 
> > then you would like the implementation to play nicely.  In this case the 
> > user will compute and eventually call MPI_Test or MPI_Wait (or a flavor 
> > thereof).
> >
> > If you used the non-blocking sends to post a bunch of communications that
> > you are going to then wait to complete, you probably don't care about the
> > CPU -- you just want the messaging done.  In this case the user will call 
> > MPI_Wait after posting everything it wants done.
> >
> > One way the implementation *could* behave is to assume the user is trying
> > to overlap comm. and comp. until it sees an MPI_Wait, at which point it
> > could go into this theoretical "burn CPU to make things go faster" mode.  
> > That mode could, for example, tweak the interrupt coalescing on an 
> > ethernet NIC to process packets more quickly (I don't know off the top of 
> > my head if that would work or not; it's just an example).
> 
> Maybe if you were using a channel interface (sockets) and all messages
> were to the same remote process then it might make sense to coalesce all
> the sends into a single transaction and just send this in the MPI_Wait
> call.  The latency for a bigger network transaction *might* be lower
> than the sum of the issue rates for smaller ones.

This is exactly what MPICH2 does for the one-sided calls; see Thakur et. 
al in EuroPVM/MPI 2004.  It can be a very big win in some situations.

> I'd hope that a well written application would bunch all it's sends into
> a single larger block when possible though if this optimisation was
> possible though.

We would hope that too, but applications do not always adhere to best 
practice.

> Given any reasonably fast network not doing anything until the MPI_Wait
> call however would destroy your latency.  It strikes me as this isn't
> overlapping comms and compute though rather artificially delaying comms
> to allow compute to finish, seems rather pointless?

I agree that postponing progress until MPI_Wait for the purposes of 
providing lower CPU utilization would be pointless.  It can be useful for 
coalescing purposes, as mentioned above.  But certainly there will be a 
latency cost.

> If you had a bunch of sends to do to N remote processes then I'd expect
> you to post them in order (non-blocking) and wait for them all at the
> end, the time taken to do this should be (base_latency + ( (N-1) * M ))
> where M is the recpipiocal of the "issue rate".  You can clearly see
> here that even for small number of batched sends (even a 2d/3d nearest
> neighbour matrix) the issue rate (that is how little CPU the send call
> consumes) is at least as important that the raw latency.

Well I wasn't trying to start an argument about the importance of CPU
utilization as it relates to issue rate :).  The original question simply
asked if there was generally an advantage to doing what you expect people
to do anyway!  And I think that we agree the answer is yes.

> > All of this is moot of course unless the implementation actually has more
> > than one algorithm that it could employ...
> 
> In my experience there are often dozens of different algorithms for
> every situation and each has their trade offs.  Choosing the right one
> based on the parameters given is the tricky bit.

Absolutely!  And which few of those dozens are applicable to a wide-enough 
range of situations that you want to actually implement/debug them?

Rob