[Beowulf] Alternative to MPI ABI

Tue Mar 22 13:35:18 PST 2005

On Tue, 22 Mar 2005, Robert G. Brown wrote:

> Hmmm, looks like the list is about to have Doug's much desired
> discussion of Community Goals online.  I'll definitely play.

Yes, Doug's prep work for the ClusterWorld Summit in May triggered my 
initial response.

> The following is a standard rgb thingie, so humans with actual work to
> do might not want to read it all right now... (Jeff L., I do have a
> special message coming to the list "just for you";-)
> 
> On Tue, 22 Mar 2005, Donald Becker wrote:

I sent this just a few minutes ago... how did you write a chapter-long 
reply?  Presumably clones, but how do you synchronize them?

> >   There needs to be new information interfaces which
> >    - report usable nodes (which node are up, ready and will permit us
> >          to start processes)
> >    - report the capability of those nodes (speed, total memory)
> >    - report the availability of those nodes  (current load, available 
> > memory)
> >    Each of these information types is different and may be provided
> >    by a different library and subsystem.  We created 'beostat', a status
> >    and statistics library, to provide most of this information.

> Agreed.  I also have been working on daemons and associated library
> tools to provide some of this information as well on the fully
> GPL/freely distributable side of things for a long time.

There are GPL versions of BeoStat and BeoMap.  (Note: GPL not LGPL.)

Admittedly they are older versions, but they are still valid.  We are 
pretty good about not changing the API unless there is a flaw that can't 
be worked around.  Many other projects seem to take the approach of 
"that's last weeks API".

That said, we are designing a new API for BeoStat and extensions to 
BeoMap.  We have to make significant changes for hyperthreading and 
multi-core, and how they relate to NUMA.  We are taking this opportunity 
to clean up the ugliness that lingers from back when the Alpha was "the" 
64 bit processor.

> I've just
> started a new project (xmlbenchd) that should be able to provide really
> detailed capabilities information about nodes via a daemon interface
> from "plug in" benchmarks, both micro and macro (supplemented with data
> snarfed from /proc using xmlsysd code fragments).

The trick is providing useful capability information, without 
introducing complexity.  I see don't benchmark results, even microBMs, as 
being directly usable for local schedulers like BeoMap.

> xmlsysd already provides CPU clock, total memory, L2 cache size, total
> and available memory, and PID snapshots of running jobs.

BeoStat provides both static info the CPU clock speed and total memory.  
It provides dynamic info on load average (the standard three values per 
node), CPU utilization (per processor! not per node), memory used, network 
traffic for up to four interfaces.

I don't see L2 cache size as being useful.  Beostat provides the processor 
type, which is marginally more useful but still largely unused.

Nor is the PID count useful in old-style clusters.  I know that Ganglia 
reports it, but like most Ganglia statistics it's mostly because the 
number is easy to get, not because they know what to do with it!  (I wrote 
the BeoStat->Ganglia translator, and consider most of their decisions as 
being, uhmmm, ad hoc.)  The PID count *is* useful in Scyld clusters, since 
we run mostly applications, not 20 or 30 daemons.

Not that BeoStat doesn't have it's share of useless information.  
Like reporting the available disk space -- a number which is best ignored.

> use (or will use in the case of xmlbenchd) xml tags to wrap all output,

Acckkk!  XML!  Shoot that man before he reproduces.  (too late)

> XMLified tags also FORCE one to organize and present the data

My, my, my.
Just like Pascal/ADA/etc forces you write structured programs?

> hierarchically and extensibly.  I've tried (unsuccessfully) to convince
> Linus to lean on the maintainers of the /proc interface to achieve some
> small degree of hierarchical organization and uniformity in

Doh!  Don't you dare use one of my own pet peeves against me!  /proc is a 
hodge-podge of formats, and people change them without thinking them 
through.

I still remember the change to /proc/net/dev to add a field.
The fact that the new field was only used for PPP, while breaking every 
existing installation ("who uses 'ifconfig' or 'netstat'?") didn't seem to 
deter the change, and once made it was impossible to change back.

But that still doesn't make XML the right thing.  Having a real design and 
keeping interfaces stable until the next big design change is the answer.

> There are several advantages to using daemons (compared to a kernel

Many people assume that much of what Scyld does is in the kernel.
There are only a few basic mechanisms in the kernel, largely BProc, with 
the rest implemented as user-level subsystems.  And most of those subsystems
are partially usable in a non-Scyld system. 

The reason to use kernel features is
   to implement security mechanism (never policy)
   to allow applications to run unchanged
We use kernel hooks only for the unified process space, process migration 
and the security mechanism.  Managing process table entries and correctly 
forwarding signals can only work with a kernel interface.

Having the node security mechanism in the kernel allows us to implement 
policy with unprivileged user-level libraries.  That means an application 
can provide its own tuned scheduler function, or use a dynamic library 
provided by the end user.

Otherwise the scheduler must run as a privileged daemon, tunable only by 
the system administrator.  It could be worse: some process migration 
systems put the scheduler policy in the kernel itself!

> Still, I totally agree that this is EXACTLY the kind of information that
> needs to be available via an open standard, universal, extensible,
> interface.

Strike the word "extensible".  That's a sure way to end up with a 
mechanism that is complex and doesn't do anything well.

> >   An application should be able to use only a subset of provided
> >     processors if they will not be useful (e.g. an application that uses
> >     a regular grid might choose to use only 16 of 23 provided nodes.
> Absolutely.  And this needs to be done in such a way that the programmer
> doesn't have to work too hard to arrange it.  I imagine that this CAN be
> done with e.g. PVM or some MPIs (although I'm not sure about the latter)
> but is it easy?

It is with out BeoMPI, which runs single threaded until it hits 
MPI_Init().  That means the application can modify its own schedule, or 
read its configuration information before deciding it will use MPI.

One aspect of our current approach is that it requires remote_fork().  
Scyld Beowulf already has this, but even with our system a remote fork may 
be more expensive than just a remote exec() if the address space is dirty.
I believe that MPI-like library can get the much of the same benefit, at 
the cost of a little extra programming, by providing flags that are only 
set when this is a remote or slave (non-rank-0) process.  (That 
last line is confusing: consider the case where MPI rank 0 is supposed to 
end up on a remote machine, with no processes left on the originating 
machine.)

> >   There needs to be new process creation primitives.
> >     We already have a well-tested model for this: Unix process
> Agreed, but while dealing with this one also needs to think about
> security.

We have ;->.

> Grids and other distributed parallel computing paradigms are

Grids have a fundamental security problem: how do you know what you are 
running on the remote machine?  Is it the same binary?  With the same 
libraries?  Linked in the same order?  With the same kernel?  Really the 
same kernel, or one with changed semantics like RedHat's 
"2.4-w/2.6-threading".  That's not even covering the malicious angle: 
"thank you for providing me with the credentials for reading all of your 
files".

Operationally, Grids have the problem that they must define both the 
protocols and semantics before they can even start to work, and then there 
will be a lifetime of backwards and forward compatibility issue.
You won't see this at first, just like the first version of 
Perl/Python/Java was "the first portable language".  But version skew and 
semantic compatibility is *the* issue to deal with, not "how can we hack 
it to do something for SCXX". 

> > > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on 
> > > pay-per-view (read: no need for time-consuming, potentially fruitless 
> > > attempts to get MPI implementors to agree on anything)
> > 
> > Hmmmm, does this show up on the sports page?
> 
> It's actually very interesting -- the Head Node article in CWM seemed to
> me to be a prediction that MPI was "finished" in the sense that it is
...
> "finished" in the sense of being complete and needing any further
..
> "finished" in the sense that it needs such a radical rewrite

I fall into the "finished -- lets not break it by piecemeal changes" camp.  

Many "add checkpointing to MPI" and "add fault tolerance to MPI" projects
have been funded for years.  We need a new model that handles dynamic 
growth, with failures being just a minor aspect of the design.  I don't 
see how MPI evolves into that new thing.

> To summarize, I think that the basic argument being advanced (correct me
> if I'm wrong) is that there should be a whole layer of what amount to
> meta-information tools inserted underneath message passing libraries of
> any flavor so that (for example) the pvm console command "conf" returns
> a number that MEANS SOMETHING for "speed", and in fact so that the pvm

Slight disagreement here: I think we need multiple subsystems that work 
well together, rather than a single do-it-all library.  The architecture 
for a status-and-statistics system (BeoStat for Scyld) is different than 
for the scheduler (e.g. BeoMap), even though one may depend on the API of 
the other.  If we put it all into one big library, it will be difficult to 
evolve and fix.  (I'm assuming we can avoid API creep, which may not hold 
true.)

Donald Becker
Scyld Software
Annapolis MD 21403			410-990-9993