[Beowulf] Alternative to MPI ABI

Wed Mar 23 07:14:56 PST 2005

On Tue, 22 Mar 2005, Donald Becker wrote:

> > On Tue, 22 Mar 2005, Donald Becker wrote:
> 
> I sent this just a few minutes ago... how did you write a chapter-long 
> reply?  Presumably clones, but how do you synchronize them?

Easy.  Spawn a few (don't ask) and send them an MPI_Barrier(:-P)

> > Agreed.  I also have been working on daemons and associated library
> > tools to provide some of this information as well on the fully
> > GPL/freely distributable side of things for a long time.
> 
> There are GPL versions of BeoStat and BeoMap.  (Note: GPL not LGPL.)
> 
> Admittedly they are older versions, but they are still valid.  We are 
> pretty good about not changing the API unless there is a flaw that can't 
> be worked around.  Many other projects seem to take the approach of 
> "that's last weeks API".
> 
> That said, we are designing a new API for BeoStat and extensions to 
> BeoMap.  We have to make significant changes for hyperthreading and 
> multi-core, and how they relate to NUMA.  We are taking this opportunity 
> to clean up the ugliness that lingers from back when the Alpha was "the" 
> 64 bit processor.

And I'm not critiquing your corporate efforts in any way -- as I said, I
hardly have time to do the things I want to do in this arena because I
have to make a (meager) living, care for a family, and advance my World
of Warcraft character at the expense of sleep.  Finding a corporate
model that pays you to do what you'd likely want to be doing anyway is
laudable.  I just happen to work on the fully GPL side of things and
think that there are certain wheels that need to be reinvented several
times before they are really gotten right.

> > I've just
> > started a new project (xmlbenchd) that should be able to provide really
> > detailed capabilities information about nodes via a daemon interface
> > from "plug in" benchmarks, both micro and macro (supplemented with data
> > snarfed from /proc using xmlsysd code fragments).
> 
> The trick is providing useful capability information, without 
> introducing complexity.  I see don't benchmark results, even microBMs, as 
> being directly usable for local schedulers like BeoMap.

For local schedulers, perhaps not, because they tend to run in
heterogeneous environments anyway.  I am thinking ahead to gridware.  A
grid scheduler really does need to be able to ask for N nodes that have
at least M memory, L L2 cache size (which might e.g. affect whether the
presumed code blocking runs fast or slow), random memory access rates of
at least R, and stream numbers of at least [a,b,c,d].  Or it might ask
for N nodes that can run a particular "fingerprint" macro application in
at most S seconds (plus sundry memory size constraints).  The macro
application to be tested might even be their own.

The grid might contain Celerons (small cache) to Xeons and Opterons, 32
bit and 64 bit memory pathways.  Simply knowing CPU clock isn't enough
as e.g. Opterons or AMDs in general might have significantly lower
clocks than P4's or Celeries and still be superior in speed.  A typical
grid user may not have the faintest idea of how all those parameters
contribute to overall performance, but if the daemon is rigged to return
a "fingerprint" vector of results, it becomes at least possible to write
e.g. a nifty front end GUI with little slider bars or the like for users
to set capability requests.

Note well that the complexity is there, like it or not.  One can do what
is usually done and simply ignore it, or provide a tool that can produce
a projective picture of the complexity that is "sufficient" that most
programmers can find within it a metric that can be used to optimize
performance of their code in some way.

This is the other place that I see xmlbenchd as being (more) useful:
inside applications.  ATLAS's optimization design is predicated on the
existence of a hierarchy of memory latencies with close to an order of
magnitude difference between them, as well as certain CPU instructions
that speed certain orderings of operations by as much as 2-3.
Algorithms and block sizes are switched to take maximum advantage of the
PARTICULAR L2 cache size of a given architecture in its PARTICULAR
latency relationship to both registers and regular memory. However,

ATLAS's autotuning build process is really, really complex but because
it is prepackaged, it hides all the complexity of the system inside is
gradient-searching build scripts.  I think that a perfectly legitimate
question in computer science is whether or not this is really necessary
-- whether in particular a suitable set of projective measures exist
that can be extracted by MB's and used to "tune" a linear algebra
library at runtime rather than at build time.  By extension, whether or
not there are programs out there with perhaps simpler blocking and
algorithmic decisions that can be decided on the basis of one or more of
the projective measures, where it may be as simple as choosing how to
manage trigonometry in code (using sqrt() calls instead of evaluating
sin/cos/tan).

It is quite harmless for a daemon to run a rather large set of
benchmarks -- for example to time all the math routines in libm (not
that I'M going to write the code for this:-) -- so that anybody writing
code can run a GUI, select both host and e.g. sin() from scrolled lists,
and have the function's timings on the PARTICULAR SYSTEM instantly
displayed.  Or (if you prefer) click a couple of things and see not only
stream, but a graph of stream as a function of vector size from vectors
of less than a page through 20 MB in length on a log scale).  I can't
help but think that this information would be really useful to many
programmers, even if they didn't actually write hooks to query the
daemons into their actual code.  My goal is to make this so simple that
it is just plain automatic -- install the rpm, boot the system, wait a
bit (for the daemon to run benchmarks in specified windows of idle time
e.g. during the first boot) and from then on one has access to the
information.

This is what is NOT true with e.g. stream, lmbench etc today.  Just
getting lmbench is not an exercise for the faint of heart as you have to
install tools to get the tool.  Stream is better, but it certainly isn't
a prepackaged component of all distibutions where you can just "yum
install stream" and have not only stream installed but run and its
results placed somewhere permanent from which they can be retrieved in a
heartbeat.  And stream isn't enough -- it only provides four projective
measures of systems performance on a single plane of the primary
relevant dimension!

> > xmlsysd already provides CPU clock, total memory, L2 cache size, total
> > and available memory, and PID snapshots of running jobs.
> 
> BeoStat provides both static info the CPU clock speed and total memory.  
> It provides dynamic info on load average (the standard three values per 
> node), CPU utilization (per processor! not per node), memory used, network 
> traffic for up to four interfaces.

xmlsysd wraps up quite a bit more information.  wulfstat's "memory"
display shows pretty much the same set of information as running "free"
on all the nodes inside a delay loop -- this can be useful when
debugging e.g. a memory leak (including the ones I had when writing
wulfstat itself -- using a tool to debug itself:-).  However, wulfstat's
default display is close to this as it is the most important information
I agree.

> I don't see L2 cache size as being useful.  Beostat provides the processor 
> type, which is marginally more useful but still largely unused.

Useful or not, it is trivial to provide, and as I argue above it SHOULD
be useful to programmers who can access the information inside
applications seeking to optimize block sizes and strides for certain
vector operations.  Whether it is useful at this moment or not may be
more related to the fact that most programmers don't have the patience
to write the code to parse the information out of /proc/cpuinfo or just
"know" what it is for the architecture of their particular cluster and
do a rebuild after altering a few #defines instead of a dynamic
optimization as a consequence.  This is fine (again) for heterogeneous
environments but less good for grids, especially ones that mix
generations of hardware.

> Nor is the PID count useful in old-style clusters.  I know that Ganglia 
> reports it, but like most Ganglia statistics it's mostly because the 
> number is easy to get, not because they know what to do with it!  (I wrote 
> the BeoStat->Ganglia translator, and consider most of their decisions as 
> being, uhmmm, ad hoc.)  The PID count *is* useful in Scyld clusters, since 
> we run mostly applications, not 20 or 30 daemons.
> 
> Not that BeoStat doesn't have it's share of useless information.  
> Like reporting the available disk space -- a number which is best ignored.

I didn't mean pid count.  xmlsysd/wulfstat provides a top-like view of
running processes, with user-specifiable filters to exclude/include
unwanted/wanted processes.  The default excludes all root processes, for
example.

> > use (or will use in the case of xmlbenchd) xml tags to wrap all output,
> 
> Acckkk!  XML!  Shoot that man before he reproduces.  (too late)

I'll just spawn more clones...;-)

> > XMLified tags also FORCE one to organize and present the data
> 
> My, my, my.
> Just like Pascal/ADA/etc forces you write structured programs?

Ah, I can see that we'll have to agree to semi-disagree here.  Yes,
pascal sucks, partly because a structured program isn't what its
designers thought that it was.  However, ANSI C, as opposed to K&R C,
does not suck because it does indeed force one towards more structure
where it counts.

XML can certainly be used correctly or abused, and in my cynical view
the first cut at an xml encapsulation of any given data structure is
likely to be wrong just like the first cut of writing the key structs in
a C or C++ application (the "data objects") is likely to be wrong.

However, FOR ITS INTENDED PURPOSE it incorporates a particular
discipline, simply by its requirement for strict nesting of tags.  Yes,
there is nothing to stop one from loading multiple data objects inside a
single tag and forcing an end user to parse them out the hard way, and
it is sometimes not easy to see what should be a tag by itself and what
should be in an attribute, but still, a good xml encapsulation of the
kind of data I'm talking about is pretty much a 1:1 map onto a data
structure and should precisely mirror that data structure.

> > hierarchically and extensibly.  I've tried (unsuccessfully) to convince
> > Linus to lean on the maintainers of the /proc interface to achieve some
> > small degree of hierarchical organization and uniformity in
> 
> Doh!  Don't you dare use one of my own pet peeves against me!  /proc is a 
> hodge-podge of formats, and people change them without thinking them 
> through.
> 
> I still remember the change to /proc/net/dev to add a field.
> The fact that the new field was only used for PPP, while breaking every 
> existing installation ("who uses 'ifconfig' or 'netstat'?") didn't seem to 
> deter the change, and once made it was impossible to change back.
> 
> But that still doesn't make XML the right thing.  Having a real design and 
> keeping interfaces stable until the next big design change is the answer.

Oh, I agree, but the "real design" is going to have exactly the same
hierarchical features that xml attempts to enforce or it will be more of
the same old crap.  Does one have to use xml to design a decent data
hierarchy with consistent parsing rules?  No, of course not.
/etc/passwd, /etc/group, /etc/shadow for example are a triplet of files
that are a living counterexample (although they are not autodocumenting,
which I personally think is a useful feature of xml tags).  Do even the
best programmers in the world come CLOSE to achieving hierarchical
consistency as a general rule?  They do not, as /proc clearly
demonstrates although there are plenty of other data views in /etc that
are equally poignant counterexamples.

The thing that is nice about xml is that it IS, like it or not, a
consistently parseable view of structured data with rules that are
intended to enforce what all programmers should be doing anyway.  It
doesn't guarantee that they will accomplish this intent, and it can be
munged.  But it isn't as EASY to produce garbage as it is with free-form
roll-your-own interfaces.

I also think that you underestimate the importance of extensibility.
One major PITA about /proc is that in the ordinary course of the
evolution of new technologies one eventually adds a new feature or
device (such as a new network) that is clearly in the "network"
hierarchy, but that has new objects that are a part of its essential
description.  How CAN one add the new data without breaking old tools?
With xml the problem doesn't even exist.  Tags that aren't parsed are
ignored, and if one's hierarchical description was halfway decent in the
first place the addition of a new kind of network can "inherit" all the
relevant old features, add new tags for the new features, and permit new
tools to be written or old ones to be modified that can use the new
information without breaking the old ones in any way.

This is a problem with e.g. /etc/passwd.  Suppose one suddenly needed to
add a field.  For example, let's imagine that /etc/passwd is going to be
modified to function across an entire toplevel domain, e.g. a
University, for single sign-in purposes.  In addition to the usual
information, it now will need a field hierarchy to set access
permissions by e.g. department, and may need different shadowed
passwords per department, or different user id's per department.

It is impossible to add these to /etc/passwd now without breaking more
things than one can possibly imagine.  The only ways I can think of to
do it are to overload the one data field that doesn't have a prescribed
function (often done, actually, as a hack) or create another file
cross-referenced by e.g. user id.

If /etc/passwd were laid out in xml (or an EQUIVALENT hierarchy), it
would be trivial and would break nothing.

This is very similar to the WYSIWYG vs markup debate.  There are those
who swear that WYSIWYG editors are great and permit complete idiots to
produce lovely looking professional documents.  They are, of course,
totally wrong 90% of the time -- what you actually get out of most
WYSIWYG editors is a pile of user-formatted crap that doesn't even
vaguely comply with a unified style (what size and style font should I
use here to start sections, today, hmmm:-).  Markup or e.g. latex
"force" a consistent hierarchical view of plain old text documents the
same way that xml CAN "force" such a view for data objects.

Now I really must go...

   rgb

> > There are several advantages to using daemons (compared to a kernel
> 
> Many people assume that much of what Scyld does is in the kernel.
> There are only a few basic mechanisms in the kernel, largely BProc, with 
> the rest implemented as user-level subsystems.  And most of those subsystems
> are partially usable in a non-Scyld system. 
> 
> The reason to use kernel features is
>    to implement security mechanism (never policy)
>    to allow applications to run unchanged
> We use kernel hooks only for the unified process space, process migration 
> and the security mechanism.  Managing process table entries and correctly 
> forwarding signals can only work with a kernel interface.
> 
> Having the node security mechanism in the kernel allows us to implement 
> policy with unprivileged user-level libraries.  That means an application 
> can provide its own tuned scheduler function, or use a dynamic library 
> provided by the end user.
> 
> Otherwise the scheduler must run as a privileged daemon, tunable only by 
> the system administrator.  It could be worse: some process migration 
> systems put the scheduler policy in the kernel itself!
> 
> > Still, I totally agree that this is EXACTLY the kind of information that
> > needs to be available via an open standard, universal, extensible,
> > interface.
> 
> Strike the word "extensible".  That's a sure way to end up with a 
> mechanism that is complex and doesn't do anything well.
> 
> > >   An application should be able to use only a subset of provided
> > >     processors if they will not be useful (e.g. an application that uses
> > >     a regular grid might choose to use only 16 of 23 provided nodes.
> > Absolutely.  And this needs to be done in such a way that the programmer
> > doesn't have to work too hard to arrange it.  I imagine that this CAN be
> > done with e.g. PVM or some MPIs (although I'm not sure about the latter)
> > but is it easy?
> 
> It is with out BeoMPI, which runs single threaded until it hits 
> MPI_Init().  That means the application can modify its own schedule, or 
> read its configuration information before deciding it will use MPI.
> 
> One aspect of our current approach is that it requires remote_fork().  
> Scyld Beowulf already has this, but even with our system a remote fork may 
> be more expensive than just a remote exec() if the address space is dirty.
> I believe that MPI-like library can get the much of the same benefit, at 
> the cost of a little extra programming, by providing flags that are only 
> set when this is a remote or slave (non-rank-0) process.  (That 
> last line is confusing: consider the case where MPI rank 0 is supposed to 
> end up on a remote machine, with no processes left on the originating 
> machine.)
>  
> > >   There needs to be new process creation primitives.
> > >     We already have a well-tested model for this: Unix process
> > Agreed, but while dealing with this one also needs to think about
> > security.
> 
> We have ;->.
> 
> > Grids and other distributed parallel computing paradigms are
> 
> Grids have a fundamental security problem: how do you know what you are 
> running on the remote machine?  Is it the same binary?  With the same 
> libraries?  Linked in the same order?  With the same kernel?  Really the 
> same kernel, or one with changed semantics like RedHat's 
> "2.4-w/2.6-threading".  That's not even covering the malicious angle: 
> "thank you for providing me with the credentials for reading all of your 
> files".
> 
> Operationally, Grids have the problem that they must define both the 
> protocols and semantics before they can even start to work, and then there 
> will be a lifetime of backwards and forward compatibility issue.
> You won't see this at first, just like the first version of 
> Perl/Python/Java was "the first portable language".  But version skew and 
> semantic compatibility is *the* issue to deal with, not "how can we hack 
> it to do something for SCXX". 
> 
> > > > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on 
> > > > pay-per-view (read: no need for time-consuming, potentially fruitless 
> > > > attempts to get MPI implementors to agree on anything)
> > > 
> > > Hmmmm, does this show up on the sports page?
> > 
> > It's actually very interesting -- the Head Node article in CWM seemed to
> > me to be a prediction that MPI was "finished" in the sense that it is
> ...
> > "finished" in the sense of being complete and needing any further
> ..
> > "finished" in the sense that it needs such a radical rewrite
> 
> I fall into the "finished -- lets not break it by piecemeal changes" camp.  
> 
> Many "add checkpointing to MPI" and "add fault tolerance to MPI" projects
> have been funded for years.  We need a new model that handles dynamic 
> growth, with failures being just a minor aspect of the design.  I don't 
> see how MPI evolves into that new thing.
> 
> > To summarize, I think that the basic argument being advanced (correct me
> > if I'm wrong) is that there should be a whole layer of what amount to
> > meta-information tools inserted underneath message passing libraries of
> > any flavor so that (for example) the pvm console command "conf" returns
> > a number that MEANS SOMETHING for "speed", and in fact so that the pvm
> 
> Slight disagreement here: I think we need multiple subsystems that work 
> well together, rather than a single do-it-all library.  The architecture 
> for a status-and-statistics system (BeoStat for Scyld) is different than 
> for the scheduler (e.g. BeoMap), even though one may depend on the API of 
> the other.  If we put it all into one big library, it will be difficult to 
> evolve and fix.  (I'm assuming we can avoid API creep, which may not hold 
> true.)
> 
> Donald Becker
> Scyld Software
> Annapolis MD 21403			410-990-9993
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu