[Beowulf] Re: Re: Home beowulf - NIC latencies

Thu Feb 17 07:14:48 PST 2005

On Wed, 16 Feb 2005, Greg Lindahl wrote:

> On Wed, Feb 16, 2005 at 03:03:16PM -0500, Robert G. Brown wrote:
> 
> > Optimizing any particular MPI (or PVM) command for either extreme is
> > then like robbing Peter to pay Paul, when Peter and Paul are a single
> > bicephalic individual that has to pay protection money to the mob for
> > every theft transaction (oh how I just LOVE to fold, spindle and
> > mutilate metaphors).  
> 
> Um, most MPI implementations have at least 3 algorithms, for short,
> long, and very long messages. So are they all breaking your rule?

No, as I noted later in the (yes, long:-) message.  That's what there
should be.  Although that they do isn't clear to the user, and the user
has no control over it (that I can see in the standard).  They have to
trust the implementation to do the right thing.

> It's *unoptimizing* some of the cases that's at question. Most MPIs
> unoptimize compute/communication overlap with long messages, because
> it's hard work to get that right without hurting all short messages.

Again, I think that we in agreement.  All I was ultimately suggesting is
that message passing libraries that contain complex higher level
commands that make optimization decisions (including the decision not to
optimize) that result in a complex command that may not be optimal for a
significant number of complex cases might benefit from having access to
lower level primitives from which the actual complex commands are built
so users can roll their own within the library without having to resort
to raw networking.

You might not agree with this suggestion, but it is as you say the point
in question.  As I also said, I'm not an MPI expert by any means and
therefore have to go look up commands beyond the MPI 1 standard (which
look not horribly unlike the PVM command set as far as communication is
concerned) and am probably shaky there, but looking them up on the
mpi-forum.org site, it looks like MPI 2 adds MPI_PUT, MPI_GET,
MPI_ACCUMULATE which are just exactly what I was suggesting and what I
would have hoped for, especially if they are indeed the primitives from
which at least some of the higher order commands are built.  If so,
users can either choose to use the optimized/unoptimized higher level
commands provided or (if they understand their problem and hardware)
roll their own.

This is the distinction I was talking about.  MPI originally passed
messages at a high level of abstraction to wrap a variety of mechanisms
in use on big supercomputers (not forgetting that it was a consortium of
the vendors of such supercomputers that wrote the standard in response
to pressure from the government and other major consumers who were tired
of rewriting code every time a new supercomputer was released with its
own internals and API for moving data between processors/processes).  It
(I think deliberately) avoided providing any sort of interface that
might be interpreted as a "thin" wrapper to those internals that were
responsible for minimal latency, maximum bandwidth movement of data.
Whether this was to make the government happy (hiding the detail) or to
make themselves happy (leaving a purchaser of a supercomputer with an
incentive to write optimizations in their native API and hence become
"hooked" on the hardware) is a moot point.

PVM has a different, but related, history.  It was built on top of
networking from the beginning, more or less, and was deliberately
designed to hide the networking primitives (specifically) from the
programmer where MPI might have been hiding shared memory primitives and
create a "virtual machine" where MPI was running on REAL machines.  It
if anything went out of its way to avoid RMA-like message passing
commands that "look" like a wrapper to shared memory following instead a
fairly simple reliable message transmission model and in the end (3.x)
had almost exactly the same range and general form of commands as MPI
1.x for the bulk of what a user was likely to do, with maybe a bit nicer
control interface over the virtual machine and a bit less control over
collective operations.

Looking over (for the first time) the MPI 2 additions, I have to say
that they look very nice, possibly nice enough to finally consider
switching to MPI from PVM.  Alternatively, it is something that should
be cloned in PVM -- PVM would really benefit from PVM_GET, PVM_PUT, and
some synchronization primitives.  Provision of what amount to wrappers
on raw RMA primitive commands (that can be/should be tuned for the
hardware) and the separation of the RMA part and any synchronization
components mean that a serious programmer has a lot of ability to
control and optimize (assuming only that these commands truly are
implemented as primitives as used to develop the higher level commands)
without leaving the library, while people are able to use the higher
level collectives when they are either a good match for their task or if
they are a beginner and not ready to tackle lower level programming.

The only thing I still don't find (on a fairly rapid lookover) is a
discussion on just what e.g. broadcast does or how to make it vary what
it does.  Part of this of course doesn't belong in a standards document
which isn't intended to describe algorithms or implementations at that
level of detail.  However, one part does.  I think it matters a great
deal to the programmer to know whether or not broadcast (and other
commands) are indeed hardware primitive or if they are implemented on
top of point-to-point communications primitives that may or may not
involve diverting intermediary processors from their running tasks (and
ditto for scatter/gather type operations).  

This seems like it might be a programming decision point for people who
really want to hand-optimize their code.  Again, this is based on my
experiences in PVM, where I've tried using broadcast several times in
master/slave contexts expecting to reduce latency and communications
times only to find that the command was de facto serialized and in fact
took as long or longer than just running a loop over point to point
communications calls.  Perhaps MPI does it better, or differently, but
it doesn't LOOK like it is anything but a black box which can swing from
being good on one network to terrible on another without warning.

How to implement such a thing in a standard is an open question, but
from a programmer interface point of view having a set of commands that
can query and set variables to control the back end behavior of
collectives or determine properties of the hardware in the cluster would
be very useful.  Just one creative idea might be for MPI to provide an
optional initialization command to run on a cluster that builds a table
of quiescent-state and cpu-loaded-state latencies for short, medium, and
long messages both point to point and in collective mode.  The same
table might hold some describing the selected hardware device such as
hw_bcast=TRUE along with the broadcast latency.

>From this one might be able to build portable MPI programs that run
optimally on Myrinet while they still run optimally on gig Ethernet,
with or without e.g. a hardware RDMA command that significantly affects
and redistributes the CPU loading per message.

But maybe this is all too complicated, or doesn't belong in the standard
per se.  It is indeed like the ATLAS thing, but then, I think that ATLAS
is sheer genius although it is also cumbersome and clunky to build...;-)
I just dream of the day that ATLAS-like runtime optimization isn't so
clunky and is based on tools that create tables of microbenchmark
numbers that ARE sufficiently accurate and rich to achieve
near-optimization without running a build loop that sweeps and searches
a high-dimensional space...:-)

  rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu