Dolphin Wulfkit

Tony Skjellum tony at MPI-SoftTech.Com
Fri May 3 07:52:23 PDT 2002

Joachim, as I mentioned in Florida, the lazy release of locked
memory is a tradeoff, not a panacea, in production MPI implementations.
It is also not everything that persistent communication can do.
Persistent point-to-point is a perfect hash to all the resources
and other needed issues; caching is at best a perfect hash.

Giving the impression that this is just a "gimme" is unfair, because
it breaks otherwise correct MPI applications.  We have this
feature too, for example, for MPI/Pro 1.6.4GM.

Lazy memory unlocking breaks correct programs that use memory
dynamically.  It means that the programmer must program in a
restricted way with memory that is subject to send/receive.

MPI implementations that are told by the programmer that
data is persistent can do more than if you don't tell, and ask
them to guess.  This is because memory pinning is not the only
interesting issues
-- very strong datatype optimization
-- limited NIC or DMA other resource that can be assigned
   statically rather than through expensive statistical
   processes online
-- opportunities through profiling with persistent operations
   to establish slimmer critical path of sender to receiver
   (quasi-static channels)

I refute the idea that you can cache the collective
communication arguments effectively, unless again you assume
very strong things about the MPI program's structure.   Here's
why: you don't know if the N-1 other processes are calling
the collective operation with the same arguments, so it
is a parallel hash, not a local hash needed.   The parallel
hash needs to be compared in complexity and effort to the
original collective to see that it is not attractive, except
for very long transfers, and that this cost is paid each
time the operation is used, not once, because you have to
validate the hash each time.  Remember, if you want to
hint about that, you are changing the semantics of the
parallel model.  It is not MPI anymore; it is MPI+
restricted programming semantics (strong SPMD or such).

I agree that (in combination)
a) restricting the MPI programmer's memory model in regards
freeing memory used in send/receive;
b) lazily freeing of pages
c) hinting that the program does not violate "a" and is willing
to use "b" and "d" and "e"
d) profiling, feedback from profiling (to try to reduce critical
  path of often matching send/receive pairs)
e) argument-list caching
are effective for point to point communication, when used together.

Can your MPI achieve gains by lazy unpinning only.  Yes.  Ours does
too, but we have to write pages of explanations to users as to when
it should work correctly, and when they have to watch out.  It
also globalizes the memory semantics of programs, which is undesirable.

Persistent send/receive has no such disadvantages, it is within the
standard and the original programming model that allows users to
work with arbitrary memory.  It works with all MPI implementations,
and so is fully portable.


On Fri, 3 May 2002, Joachim Worringen wrote:

> Patrick, I agree with your posting - as I basically say the same in the
> "disclaimer" of the wwww-page
> with the PMB results.
> Regarding the placement of the process (or: assignment of ranks) on
> SMP-nodes is part of the peformance strategiy that an MPI library is
> free to follow. It seems that ScaMPI does a straight round-robin mapping
> of the process-to-node-ranks, while MPICH-GM "groups" processes on the
> same node. Both approaches are, of course, valid.
> The attachment of you results wasn't readable for me, can you post it
> again or send it via Mail? I will be happy adding it to the page. If you
> have "real" results for the P4-i860 platform, please send them, too, and
> I'll replace the other ones. As it says in the disclaimer... ;-)
> Nevertheless, I think it helps to talk about real numbers - without
> saying that these numbers give the complete picture (again, see the
> disclaimer). Unfortunately, I have no access to the mixed
> Myrinet/SCI-cluster to run application benchmarks. This would also cover
> the other performance factors, like the one's Tony does stress so much -
> however, it will be hard finding an application that does persistent
> communication, unfortunately. On the other hand, a quality MPI
> implementation should not show big differences between explicit and
> "implicit" persistent communication due to caching of the required
> ressources. This is also my personal experience with SCI-MPICH (which
> *does* optimize persistent communication - but the only benefit from it
> is that the related ressources will be the last thrown out of the
> related cache, which usually does not happen to often at all).
> BTW, I have a 8node-800MHz-DualPII-SW_LE cluster with SCI here, which
> seems to nicely match the one you mentioned. Numbers will be up soon.
>  regards, Joachim
> --
> |  _  RWTH|  Joachim Worringen
> |_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
>   | |_)(_`|
>     |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339
> _______________________________________________
> Beowulf mailing list, Beowulf at
> To change your subscription (digest mode or unsubscribe) visit

Anthony Skjellum PhD, President | MPI Software Technology, Inc.
101 South Lafayette St, Ste. 33 | Starkville, MS 39759, USA
Ph: +1-(662)320-4300 x15        | FAX: +1-(662)320-4301     | tony at

Middleware that's hard at work for you and your enterprise.(SM)

More information about the Beowulf mailing list