[Beowulf] hpcc MPIRandomAccess bug
Håkon Bugge
Hakon.Bugge at scali.com
Tue Aug 23 06:50:30 PDT 2005
Running hpcc in its latest version, I have from time to time been
exposed to missing updates in the MPIRandomAccess benchmark. The
benchmark tolerates upto 1% missing updates, hence the benchmarks
"passes" as such. However, when running a pure MPI implementation of
this benchmark, missing updates are unacceptable. Hence, I
investigated the issue. Would be interesting to know if anyone else
has been exposed to the same.
This applies to hpcc version 1.0, sub-benchmark MPIRandomAccess with
the define USE_MULTIPLE_RECV set (default set) and the define
MAX_RECV greater than 1 (default 16).
Rudimentary explanation:
Each process sends (a bucket) of updates randomly to other processes,
using the tag UPDATE_TAG. Whenever a process has sent all updates, it
sends a message to all other processes using a FINISHED_TAG, to
indicate that it has finished.
Each process has N posted MPI_Irecvs with ANY_SOURCE and ANY_TAG. At
regular intervals, a process will check completion of any of the
posted MPI_Irecvs by issuing an MPI_Testany. Most messages will
contain the UPDATE_TAG, and the process will update its part of a
global array accordingly. If the message selected by MPI_Testany
contains FINISHED_TAG, the process will decrement a counter,
initialized to the total number of processes minus one. Hence, when
this counter becomes zero and this process has sent all updates, it
has completed its work. It will then cancel its N outstanding
MPI_Irecvs by calling MPI_Cancel+MPI_Wait N times.
This implementation of the algorithm contains a bug. Assume that
"our" process has sent all updates, and has received FINISHED_TAG
from all but one other process. Assume N is 2. The "oldest" issued
MPI_Irecv has matched the last message containing updates (i.e. using
the UPDATE_TAG) from this remote process. The youngest issued
MPI_Irecv has matched the last message (i.e. the message containing
the FINISHED_TAG) sent by this remote process. Hence, ordering
between sends and receives are maintained.
Then, our process calls MPI_Testany. This MPI call will pick *any* of
the posted receives being finished. It *might* pick the message
containing the FINISHED_TAG. If that is the case, our process think
it is finished, since it has received FINISHED_TAG from all remote
processes. It will then cancel the posted MPI_Irecvs, also the one
containing the UPDATE_TAG. Hence, an update (or a bucket of updates)
will be lost.
A proposed work-around will shortly be available from me.
-h
More information about the Beowulf
mailing list