Load balancing and other issues

Mike Perry mikepery@mikepery.linuxos.org
Wed Aug 4 12:00:28 1999


Hello! I just recently set up a 6 node beowulf cluster here at the NCSA at
UIUC and I have a few questions/suggestions for bproc.

First off, is there any sort of API documentation or FAQ available? notes.txt
is nice from a theoretical viewpoint, but it doesn't help much if you wanna
write remote forking programs which take advantage of all the bells and 
whistles, and avoid pitfalls and limitations.

Second, how hard would it be to add load balancing and a real global proccess
space to bproc? This is what I was thinking:

Edit the syscall table to replace (v)fork() with our own implementation. We 
can store the old fork with a function pointer to it's entry in
sys_call_table[] before we stick our fork in it's place.

The new fork() would first check the local system load (via loadavg, IO 
activity, and free physical memory). If the load is below a certain threshold 
(say 1.0 load avg for UP machines, 2.0 for SMP, > Xmb physical memory free 
including buffers, and < X number of disk interrupts/sec), the normal old 
fork() is done.

If we exceed the load, we then query the master daemon for statistics on all
nodes (or even better, every X seconds the master daemon gathers stats and
distributes it to all nodes which store it locally, and perhaps make it
available to /proc?), and then do a bproc_rfork() to the lowest loaded 
node.

If we keep load info local, and only even check them if our load is higher than 
a treshhold, I think we can make distributed forking almost as fast as standard
fork().

As I see it, the potential gotcha's are X11 forwarding, and socket & pipe
forwarding. In an initial implementation, if the process has more than 3fd's
(or if the first three have been dup'ed sockets or files) we could simply fork 
locally regardless of load. We'll prolly have to end up doing a check for 
local UNIX domain sockets even in a final version anyways even if we manage 
to forward or reopen AF_INET sockets or open files.

What do you guys think of such a system? Is something like this already
planned and/or rejected? It didn't seem to be listed in the TODO file...

In addition, the way I've described it, the whole new fork system could 
concievably be a single optional module for the master node.

P.S. I've said 'we' because I am willing to implement these systems. I'm 
not just talking out of my ass here. Well, maybe a little :)