[Beowulf] Alternative to MPI ABI
Donald Becker
becker at scyld.com
Tue Mar 22 13:35:18 PST 2005
On Tue, 22 Mar 2005, Robert G. Brown wrote:
> Hmmm, looks like the list is about to have Doug's much desired
> discussion of Community Goals online. I'll definitely play.
Yes, Doug's prep work for the ClusterWorld Summit in May triggered my
initial response.
> The following is a standard rgb thingie, so humans with actual work to
> do might not want to read it all right now... (Jeff L., I do have a
> special message coming to the list "just for you";-)
>
> On Tue, 22 Mar 2005, Donald Becker wrote:
I sent this just a few minutes ago... how did you write a chapter-long
reply? Presumably clones, but how do you synchronize them?
> > There needs to be new information interfaces which
> > - report usable nodes (which node are up, ready and will permit us
> > to start processes)
> > - report the capability of those nodes (speed, total memory)
> > - report the availability of those nodes (current load, available
> > memory)
> > Each of these information types is different and may be provided
> > by a different library and subsystem. We created 'beostat', a status
> > and statistics library, to provide most of this information.
> Agreed. I also have been working on daemons and associated library
> tools to provide some of this information as well on the fully
> GPL/freely distributable side of things for a long time.
There are GPL versions of BeoStat and BeoMap. (Note: GPL not LGPL.)
Admittedly they are older versions, but they are still valid. We are
pretty good about not changing the API unless there is a flaw that can't
be worked around. Many other projects seem to take the approach of
"that's last weeks API".
That said, we are designing a new API for BeoStat and extensions to
BeoMap. We have to make significant changes for hyperthreading and
multi-core, and how they relate to NUMA. We are taking this opportunity
to clean up the ugliness that lingers from back when the Alpha was "the"
64 bit processor.
> I've just
> started a new project (xmlbenchd) that should be able to provide really
> detailed capabilities information about nodes via a daemon interface
> from "plug in" benchmarks, both micro and macro (supplemented with data
> snarfed from /proc using xmlsysd code fragments).
The trick is providing useful capability information, without
introducing complexity. I see don't benchmark results, even microBMs, as
being directly usable for local schedulers like BeoMap.
> xmlsysd already provides CPU clock, total memory, L2 cache size, total
> and available memory, and PID snapshots of running jobs.
BeoStat provides both static info the CPU clock speed and total memory.
It provides dynamic info on load average (the standard three values per
node), CPU utilization (per processor! not per node), memory used, network
traffic for up to four interfaces.
I don't see L2 cache size as being useful. Beostat provides the processor
type, which is marginally more useful but still largely unused.
Nor is the PID count useful in old-style clusters. I know that Ganglia
reports it, but like most Ganglia statistics it's mostly because the
number is easy to get, not because they know what to do with it! (I wrote
the BeoStat->Ganglia translator, and consider most of their decisions as
being, uhmmm, ad hoc.) The PID count *is* useful in Scyld clusters, since
we run mostly applications, not 20 or 30 daemons.
Not that BeoStat doesn't have it's share of useless information.
Like reporting the available disk space -- a number which is best ignored.
> use (or will use in the case of xmlbenchd) xml tags to wrap all output,
Acckkk! XML! Shoot that man before he reproduces. (too late)
> XMLified tags also FORCE one to organize and present the data
My, my, my.
Just like Pascal/ADA/etc forces you write structured programs?
> hierarchically and extensibly. I've tried (unsuccessfully) to convince
> Linus to lean on the maintainers of the /proc interface to achieve some
> small degree of hierarchical organization and uniformity in
Doh! Don't you dare use one of my own pet peeves against me! /proc is a
hodge-podge of formats, and people change them without thinking them
through.
I still remember the change to /proc/net/dev to add a field.
The fact that the new field was only used for PPP, while breaking every
existing installation ("who uses 'ifconfig' or 'netstat'?") didn't seem to
deter the change, and once made it was impossible to change back.
But that still doesn't make XML the right thing. Having a real design and
keeping interfaces stable until the next big design change is the answer.
> There are several advantages to using daemons (compared to a kernel
Many people assume that much of what Scyld does is in the kernel.
There are only a few basic mechanisms in the kernel, largely BProc, with
the rest implemented as user-level subsystems. And most of those subsystems
are partially usable in a non-Scyld system.
The reason to use kernel features is
to implement security mechanism (never policy)
to allow applications to run unchanged
We use kernel hooks only for the unified process space, process migration
and the security mechanism. Managing process table entries and correctly
forwarding signals can only work with a kernel interface.
Having the node security mechanism in the kernel allows us to implement
policy with unprivileged user-level libraries. That means an application
can provide its own tuned scheduler function, or use a dynamic library
provided by the end user.
Otherwise the scheduler must run as a privileged daemon, tunable only by
the system administrator. It could be worse: some process migration
systems put the scheduler policy in the kernel itself!
> Still, I totally agree that this is EXACTLY the kind of information that
> needs to be available via an open standard, universal, extensible,
> interface.
Strike the word "extensible". That's a sure way to end up with a
mechanism that is complex and doesn't do anything well.
> > An application should be able to use only a subset of provided
> > processors if they will not be useful (e.g. an application that uses
> > a regular grid might choose to use only 16 of 23 provided nodes.
> Absolutely. And this needs to be done in such a way that the programmer
> doesn't have to work too hard to arrange it. I imagine that this CAN be
> done with e.g. PVM or some MPIs (although I'm not sure about the latter)
> but is it easy?
It is with out BeoMPI, which runs single threaded until it hits
MPI_Init(). That means the application can modify its own schedule, or
read its configuration information before deciding it will use MPI.
One aspect of our current approach is that it requires remote_fork().
Scyld Beowulf already has this, but even with our system a remote fork may
be more expensive than just a remote exec() if the address space is dirty.
I believe that MPI-like library can get the much of the same benefit, at
the cost of a little extra programming, by providing flags that are only
set when this is a remote or slave (non-rank-0) process. (That
last line is confusing: consider the case where MPI rank 0 is supposed to
end up on a remote machine, with no processes left on the originating
machine.)
> > There needs to be new process creation primitives.
> > We already have a well-tested model for this: Unix process
> Agreed, but while dealing with this one also needs to think about
> security.
We have ;->.
> Grids and other distributed parallel computing paradigms are
Grids have a fundamental security problem: how do you know what you are
running on the remote machine? Is it the same binary? With the same
libraries? Linked in the same order? With the same kernel? Really the
same kernel, or one with changed semantics like RedHat's
"2.4-w/2.6-threading". That's not even covering the malicious angle:
"thank you for providing me with the credentials for reading all of your
files".
Operationally, Grids have the problem that they must define both the
protocols and semantics before they can even start to work, and then there
will be a lifetime of backwards and forward compatibility issue.
You won't see this at first, just like the first version of
Perl/Python/Java was "the first portable language". But version skew and
semantic compatibility is *the* issue to deal with, not "how can we hack
it to do something for SCXX".
> > > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on
> > > pay-per-view (read: no need for time-consuming, potentially fruitless
> > > attempts to get MPI implementors to agree on anything)
> >
> > Hmmmm, does this show up on the sports page?
>
> It's actually very interesting -- the Head Node article in CWM seemed to
> me to be a prediction that MPI was "finished" in the sense that it is
...
> "finished" in the sense of being complete and needing any further
..
> "finished" in the sense that it needs such a radical rewrite
I fall into the "finished -- lets not break it by piecemeal changes" camp.
Many "add checkpointing to MPI" and "add fault tolerance to MPI" projects
have been funded for years. We need a new model that handles dynamic
growth, with failures being just a minor aspect of the design. I don't
see how MPI evolves into that new thing.
> To summarize, I think that the basic argument being advanced (correct me
> if I'm wrong) is that there should be a whole layer of what amount to
> meta-information tools inserted underneath message passing libraries of
> any flavor so that (for example) the pvm console command "conf" returns
> a number that MEANS SOMETHING for "speed", and in fact so that the pvm
Slight disagreement here: I think we need multiple subsystems that work
well together, rather than a single do-it-all library. The architecture
for a status-and-statistics system (BeoStat for Scyld) is different than
for the scheduler (e.g. BeoMap), even though one may depend on the API of
the other. If we put it all into one big library, it will be difficult to
evolve and fix. (I'm assuming we can avoid API creep, which may not hold
true.)
Donald Becker
Scyld Software
Annapolis MD 21403 410-990-9993
More information about the Beowulf
mailing list