Robert G. Brown rgb at
Fri Oct 5 08:38:06 PDT 2001

On Fri, 5 Oct 2001, Steven Timm wrote:

> > On Fri, Oct 05, 2001 at 08:28:56AM -0500, Steven Timm wrote:
> >
> > > The rsync script is a good idea and something we are thinking
> > > of implementing--only problem do you handle the
> > > situation when a node happens to be down during a push?
> >
> > FSL uses the same generic mechanism that I use to keep all files in
> > sync. This means that when a node boots, it syncs before it returns to
> > service. There are many files that you want to maintain in synch (like
> > /etc/hosts.allow) which don't go in NIS. I would assume that systems
> > like "cfengine" (which the sysadmin community uses to keep
> > workstations configured) also do that.
> >
> Is there some way to inhibit the sync if for some reason all
> workstations end up rebooting at once?  Also, any way to force it
> manually?

There are lots of ways.  This is a common enough administrative problem
in any reasonably large domain.  We used to use nightly cron scripts to
do a variety of maintenance on systems in the department, and some of
the script tasks would load server-shared resources (e.g. NFS, NIS and
so forth).  This was back in the 10Base days with slow disks and 4 MIPS
servers (if you were lucky).  We out of necessity developed ways of
distributing the times of the cron hits to avoid the logjam.

One can easily do the same thing here -- put host-specific delays into
the boot scripts, put random delays into the boot script (which works
well enough for a few hosts but remember that poisson random doesn't
mean antibunched, and you want antibunched), institute a low overhead
antibunching handshake (ask nicely for the transfer and if the server
says no sleep a bit and ask again via e.g. a simple xinetd daemon).

The problem is that you have to ask and write your own scripting or
daemons after you hear the answers (all of which will work well enough
with a bit of effort) because it isn't a standard tool or method.  This
is really a re-lamentation of a longstanding problem that has often been
lamented on this list -- we still lack a lot of "standard" tools for
cluster management and this is one of them.  What we all really want is
an RPM with documents; what we've got is somewhat kludgy recipes.

I wish I could help with the former but I'm up to my ears in aquatic
reptiles and offput projects.  I'd strongly urge anyone who DOES tackle
this problem to consider doing it "right" after really thinking it out,
and turning their solution into a stable toolset.

I personally think that a GPL antibunching etc_file_xfer daemon would be
a gangbusters solution -- have the master daemon requests either fork a
server to service the request OR return the requester a delay (computed
from the number of pending requests and running measurements of the time
of completion); have the clients respect the delay.  That way each
server can literally service requests as fast as possible (less the
overhead of the original single queuing handshake).  It could run on top
of ssh/rsync or rsh/rsync as your security and cluster require, and
would be pretty trivial to write in either C (lowest server load) or
perl (perhaps easier to code).

I've got prototype code for the C daemon that could probably be hacked
into this if anybody is interested.


Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at

More information about the Beowulf mailing list