Dolphin Wulfkit

Joachim Worringen joachim at lfbs.RWTH-Aachen.DE
Fri May 3 09:55:35 PDT 2002

[ Maybe we should move this to comp.parallel.mpi? Anway:]

Tony Skjellum wrote:
> Joachim, as I mentioned in Florida, the lazy release of locked
> memory is a tradeoff, not a panacea, in production MPI implementations.

That's true. It can cause problems. See below.

While using the CPU for memory copying may have disadvantages (not
necessarily in all applications of MPI), the nice thing is that you can
get high bandwidth without pinning of memory. For any type of memory you
need to transfer from.

> Giving the impression that this is just a "gimme" is unfair, because
> it breaks otherwise correct MPI applications.  We have this
> feature too, for example, for MPI/Pro 1.6.4GM.

Didn't want to offend you or say that optimizing persistent
communication is a bad idea or that MPI/Pro does a bad job in any area
(have not used it myself) - all I said was: optimizing persistent
communication doesn't really show in common applications as the people
nearly never use the related MPI constructs. In theory and special
benchmarks, it works o.k., I don't doubt this. But if it is not used, it
*is* a gimme in the sense "useful but not used"... 

Nevertheless, I appreciate these efforts (and do such myself) as
somebody has to advance current programming style - and who else if not
the MPI designers should do the start.

> Lazy memory unlocking breaks correct programs that use memory
> dynamically.  It means that the programmer must program in a
> restricted way with memory that is subject to send/receive.

This may happen, see above. Using MPI_Alloc_mem/Free_mem does help.
Sending data from the stack may still cause problems. 

> MPI implementations that are told by the programmer that
> data is persistent can do more than if you don't tell, and ask
> them to guess.  This is because memory pinning is not the only
> interesting issues
> -- very strong datatype optimization

Can you elaborate on this? Do you mean building (potentially long) DMA
descriptor chains? Do you have an example, like for VIA or GM?

> -- limited NIC or DMA other resource that can be assigned
>    statically rather than through expensive statistical
>    processes online
> -- opportunities through profiling with persistent operations
>    to establish slimmer critical path of sender to receiver
>    (quasi-static channels)

These techniques work best if *all* communication is preset, I assume?
This would be nice, but is limited to a subset of problems (MPI/RT, I

> I refute the idea that you can cache the collective
> communication arguments effectively, unless again you assume
> very strong things about the MPI program's structure.   Here's
> why: you don't know if the N-1 other processes are calling
> the collective operation with the same arguments, so it
> is a parallel hash, not a local hash needed.   The parallel
> hash needs to be compared in complexity and effort to the
> original collective to see that it is not attractive, except
> for very long transfers, and that this cost is paid each
> time the operation is used, not once, because you have to
> validate the hash each time.  Remember, if you want to
> hint about that, you are changing the semantics of the
> parallel model.  It is not MPI anymore; it is MPI+
> restricted programming semantics (strong SPMD or such).

Collective communication is based on point-to-point (in most systems),
and only for all_to_all, every process needs to communicate with every
other process (in most implementations). 

Of course, all processes need to check all their buffers for
registration, but each process has two buffers at most. And unless you
send by rendez-vous, you don't care for the recv-buffer of the
destination but use pre-allocated ones instead. And if you do deliver by
rendez-vous, you always need handshake messages. I don't see the problem
here unless you think of writing zero-copy style into the same buffer as
last time - but this would be a bad idea for any send operation (use
one-sided communications instead...)! Maybe you can give an example?

> I agree that (in combination)
> a) restricting the MPI programmer's memory model in regards
> freeing memory used in send/receive;
> b) lazily freeing of pages
> c) hinting that the program does not violate "a" and is willing
> to use "b" and "d" and "e"
> d) profiling, feedback from profiling (to try to reduce critical
>   path of often matching send/receive pairs)
> and
> e) argument-list caching
> are effective for point to point communication, when used together.
> Persistent send/receive has no such disadvantages, it is within the
> standard and the original programming model that allows users to
> work with arbitrary memory.  It works with all MPI implementations,
> and so is fully portable.

As mentioned above: Agreed. It's always easier if the users states every
resource request in advance - like in the nice teaching example "bankers
algorithm". But not applicable everywhere. And not yet common practice
if it would be applicable.

 regards, Joachim

|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339

More information about the Beowulf mailing list