[Beowulf] hpcc MPIRandomAccess bug

Håkon Bugge Hakon.Bugge at scali.com
Tue Aug 23 06:50:30 PDT 2005

Running hpcc in its latest version, I have from time to time been 
exposed to missing updates in the MPIRandomAccess benchmark. The 
benchmark tolerates upto 1% missing updates, hence the benchmarks 
"passes" as such. However, when running a pure MPI implementation of 
this benchmark, missing updates are unacceptable. Hence, I 
investigated the issue. Would be interesting to know if anyone else 
has been exposed to the same.

This applies to hpcc version 1.0, sub-benchmark MPIRandomAccess with 
the define USE_MULTIPLE_RECV set (default set) and the define 
MAX_RECV greater than 1 (default 16).

Rudimentary explanation:

Each process sends (a bucket) of updates randomly to other processes, 
using the tag UPDATE_TAG. Whenever a process has sent all updates, it 
sends a message to all other processes using a FINISHED_TAG, to 
indicate that it has finished.

Each process has N posted MPI_Irecvs with ANY_SOURCE and ANY_TAG. At 
regular intervals, a process will check completion of any of the 
posted MPI_Irecvs by issuing an MPI_Testany. Most messages will 
contain the UPDATE_TAG, and the process will update its part of a 
global array accordingly. If the message selected by MPI_Testany 
contains FINISHED_TAG, the process will decrement a counter, 
initialized to the total number of processes minus one. Hence, when 
this counter becomes zero and this process has sent all updates, it 
has completed its work. It will then cancel its N outstanding 
MPI_Irecvs by calling MPI_Cancel+MPI_Wait N times.

This implementation of the algorithm contains a bug. Assume that 
"our" process has sent all updates, and has received FINISHED_TAG 
from all but one other process. Assume N is 2. The "oldest" issued 
MPI_Irecv has matched the last message containing updates (i.e. using 
the UPDATE_TAG) from this remote process. The youngest issued 
MPI_Irecv has matched the last message (i.e. the message containing 
the FINISHED_TAG) sent by this remote process. Hence, ordering 
between sends and receives are maintained.

Then, our process calls MPI_Testany. This MPI call will pick *any* of 
the posted receives being finished. It *might* pick the message 
containing the FINISHED_TAG. If that is the case, our process think 
it is finished, since it has received FINISHED_TAG from all remote 
processes. It will then cancel the posted MPI_Irecvs, also the one 
containing the UPDATE_TAG. Hence, an update (or a bucket of updates) 
will be lost.

A proposed work-around will shortly be available from me.


More information about the Beowulf mailing list