Load balancing and other issues

Erik Arjan Hendriks hendriks@cesdis1.gsfc.nasa.gov
Wed Aug 4 13:19:43 1999


On Wed, 4 Aug 1999, Mike Perry wrote:

> Hello! I just recently set up a 6 node beowulf cluster here at the NCSA at
> UIUC and I have a few questions/suggestions for bproc.
> 
> First off, is there any sort of API documentation or FAQ available? notes.txt
> is nice from a theoretical viewpoint, but it doesn't help much if you wanna
> write remote forking programs which take advantage of all the bells and 
> whistles, and avoid pitfalls and limitations.

There isn't much of an API at this point.  It's basically just the 4
rfork-like functions.  I'd be happy to take any questions about them.

A real-quick description of the API is as follows:
bproc_init() - reads /var/run/bproc to get info about the machine
               (should only be on the front end (where the master is running))
bproc_numnodes() - number of nodes (not including front end)
bproc_nodeup()   - returns true if node is alive.
bproc_nodeaddr() - get IP for node number (fills in the sockaddr arg)
bproc_rexec      - works like exec(rsh, node, cmd)
bproc_move       - move myself to a remote node
bproc_rfork      - ditto except make a new child and move it.
bproc_execmove   - exec something and then move it.  

the flags on all of theses are the VMADUMP flags.  THey control what
parts of the memory space get moved.  (safe bet = VMAD_DUMP_ALL)

I'm working on some nicer documentation of what is there.  (API and
other stuff.)

> Second, how hard would it be to add load balancing and a real global proccess
> space to bproc? This is what I was thinking:
> 
> Edit the syscall table to replace (v)fork() with our own implementation. We 
> can store the old fork with a function pointer to it's entry in
> sys_call_table[] before we stick our fork in it's place.

*snip*

> As I see it, the potential gotcha's are X11 forwarding, and socket & pipe
> forwarding. In an initial implementation, if the process has more than 3fd's
> (or if the first three have been dup'ed sockets or files) we could simply fork 
> locally regardless of load. We'll prolly have to end up doing a check for 
> local UNIX domain sockets even in a final version anyways even if we manage 
> to forward or reopen AF_INET sockets or open files.

These file issues (transparency issues, really) are a big deal.
(Bigger than they seem at first (to me anyway)) I think what you're
proposing here is essentially Mosix.  Mosix is cool because it handles
all these things.  The reason we don't use Mosix here is because it
provides these features at an unacceptably high (IMHO) cost.

Still if you want to replace regular fork, Mosix is the only way to
go.  Mosix (in a nutshell) runs processes remotely but performs all
the processes's syscalls on the node where the process originated
(including file IO) so that it appears nothing has changed.

If that's what you're looking for, you should give it a try it.  It's
even GPL'ed now.

<Philosophical Rambling>

The way I see it, we'd like the cluster to look like a single machine
kind of like Mosix provides but we don't want the take performance hit
that Mosix will incur for many (if not most) applications.

Here's one way to look at the whole thing that I like:

There are basically two extremes here.  Mosix is at one end.  PVM/MPI
distributing processes with rsh is on the other.  I believe there's a
happy medium in the middle there somewhere.

Mosix it working its way towards the middle.  They're talking about
providing temp files, etc on local file systems.  (i.e. where the
process is really running, not where it thinks its running.)  They
also talk about "migratable sockets".  (I'd really like to see how
they're going to do that.  That'll be a really good trick if they pull
it off.)

bproc starts on the PVM/MPI+rsh end is working its way in the other
direction.  It gives you a way to move + create new processes but
makes no promises about sockets or files.  The move is not
transparent.  On the other hand, there's NO performance penalty on
file IO or network IO.

Being the author of bproc, I'm obviously biased towards my approach.
You're all entitled to make your own decisions as to whether or not
I'm some kind of nut-job.

</Philosophical Rambling>

That being said, bproc's IO forwarding does need serious work.  I've
actually cut back on it a bit in my latest development version because
it was too sketchy.  You may have noticed goofiness if you've tried to
pipe stuff around.  A replacement is in the works there.

- Erik
------------------------------------------------------------
Erik Hendriks
hendriks@cesdis.gsfc.nasa.gov