[Beowulf] cluster softwares supporting parallel CFD computing

Thu Sep 7 12:15:01 PDT 2006

Greg Lindahl <greg.lindahl at qlogic.com> writes:

> On Wed, Sep 06, 2006 at 11:10:14AM -0600, Eric W. Biederman wrote:
>
>> There is fundamentally more work to do when you take an interrupt because
>> you need to take a context switch.  But cost of a context switch is in
>> the order of microseconds, so while measurable taking an interrupt should 
>> not dramatically your latency numbers.
>
> Unless, of course, your latency is a microsecond. In fact, our
> *overhead* for a single message is much less than 1 usec, so an
> interrupt per message would kill our message rate. And, finally, all
> the cpus can poll main memory in an embarrassingly parallel fashion,
> whereas interrupts involve OS contention.

I agree.  Taking an interrupt per message is clearly a loss.

Polling main memory is scalable, and if everyone is using a
different cache line and you have not overwhelmed your cache
coherence mechanism it is embarrassingly parallel.  

The cache coherency in general is a problem that scales but
it is not embarrassingly parallel.  Shared cache lines especially
with respect to writes are difficult to write.  I think I heard
worst case numbers on an Altix for a cache line fill was a
milisecond or so.

>> This is important as polling for new packets has a very significant
> opportunity
>> cost as it prevents you from get any other work done at the same time
>
> In most codes, the opportunity cost of polling is zero. But I can only
> speak from my experience, do you have any data which shows different?

Polling is a reasonable approach for the short durations say 
<= 1 milisecond, but it is really weird to explain that you can tell a
MPI application has failed to receive a message because it's cpu
utilization goes up.  Polling for seconds on end is a very rude thing
to do on a multitasking OS.

The problem from what I can tell is that latency is fundamental, and mostly
an artifact of the card implementation.  We are quickly reaching the
point we won't be able to improve latency any more.  Now possibly now
that the cpu frequency ramp has stopped the ratios of cpu frequency
and latency will stay fixed and it won't matter.

On the other hand it is my distinction impression the reason there is no
opportunity cost from polling is that the applications have not been
tuned as well as they could be.  In all other domains of programming
synchronous receives are serious looked down upon.  I don't know why
that should not apply to MPI codes as well.

For what is worth I do very much appreciate the low latency that ipath
provides.

My basic problem with the original statement was that it said
interrupts kill latency when in fact I don't believe they make a
high performance interconnect anywhere near as bad as ethernet,
and if used judiciously I believe interrupts could be used to improve
system throughput, and to not confuse everything else in the system
that assumes I/O bound applications sleep.

Eric