[Beowulf] Questions regarding interconnects

Greg Lindahl lindahl at pathscale.com
Fri Mar 25 17:58:18 PST 2005

On Fri, Mar 25, 2005 at 11:39:15PM +0100, Vincent Diepeveen wrote:

> What i do know is that you always first must avoid buffer overrun in MPI.
> That means for example:
>   MPI_Isend(....)
>   MPI_Test(&Reg,&flg,&Stat)
>   while(!flg) {
>     Myprogram_MsgPending();  // Important, read in messages and process them
>          while waiting on complete. Otherwise the own Input-Buffer can
> overflow
>                                          // and we get a deadlock.
>     MPI_Test(&Reg,&flg,&Stat);
>   }


This code is simply mistaken. When you do an MPI_Isend(), there's no
reason to immediately do MPI_Test(). Just wait until you actually want
to reuse the buffer you used in the Isend(), and then do the
MPI_Test(), which will likely succeed. MPI_Test()=True does NOT mean
the other side has received the message! It just means that the buffer
can be reused. The message may not even be in flight, it might be
queued somewhere on the sender.

In most implementations, for short outgoing messages, the MPI_Test()
will always be true, because the send will be done immediately in
MPI_Isend(). This is called an "eager send".

> All i want to do is get the job modified of a remote running processor as
> fast as possible. So i want to write in one of its arrays say 4-64 bytes at
> most.

Sounds like a job for message passing!

If I do this with message passing, the remote running program
occasionally checks to see if it has data. If so, it puts it in the
right place, and starts using it.

If I do this using the Cray SHMEM method, you seem to think you can
just drop the data in the right place and everything will be faster.
But that's not true. I have to worry about the program attempting to
use the data in the destination location when only part of the message
has arrived. To avoid that, I need some way of indicating that all of
the data has arrived, and atomically using either the old or the new
data. In SHMEM, the way you express this is to have the data arrive
into a buffer (using a PUT) and then do second PUT to a flag to
indicate that the data has all arrived. Then the recipient checks the
flag, and if it's set, copies the data from the buffer to its final
location. Then it can use it. This is the same work as the MPI case,
just done in a different way.

In short, one-sided messsaging does not help synchronization. What it
does do is occasionally get rid of a copy. But people who haven't yet
written a one-sided program always imagine it will make
synchronization easier.

In any case, it's not trivial for a recipient that only occasionally
gets a message to efficiently check for it, in any paradigm. That
seems to be what you're attempting to do. Few MPI programs do that.

-- greg

More information about the Beowulf mailing list