Minimal, fast single node queue software
David Mathog
mathog at mendel.bio.caltech.edu
Mon Dec 16 12:47:02 PST 2002
I have an application which is very simple and efficient
so that the parallel compute node part can run in much less
than 1 second. And therein lies the rub - nothing I've tried
so far can handle job queues across 20 nodes nearly that fast.
Essentially all it needs to do is:
1. start the same script on all nodes for job K in queue=NAME
where each queue is a simple first in, first out scheduler.
2. Notify the master (run a script there) when all nodes
complete job K
3. Allow job K+1 to start on a node when K completes (even though
K might still be running on other nodes).
There is no load balancing required, nor time limits.
SGE had an overhead of approximately 10 seconds to handle all the
setup/tear down, for the compute and and clean up jobs. I
messed with the SGE code but couldn't drop it down below
that, mostly because some of the time functions used in the
code only kept time to the second. PVM had about 2.5 seconds
of overhead plus it required a major kludge involving lock
files and usleep in the scripts to restrict them to run sequentially
on the remote nodes. Although it was a kludge it let
job K+1 start within .02 second of job K terminating -
at the cost of a large number of processes spinning in
usleep and checking and rechecking the lock files.
Surely there's an extant queue program around that can
handle this many very short jobs well? If so, what it is it?
The best I've been able to do so far to speed up job launch
is to use a hacked version of the linux rsh command where
rsh -z node1,node2...nodeN command
runs the same command on all listed nodes and ignores all IO.
This launches commands at a rate of about 1 node/.011 second.
That's better than anything else I've tried so far - it's limited
by rcmd() and the network, and not much else. For a cluster
with a lot more nodes than ours it would probably be worth
it to modify rsh further so that it could automatically split
the target list and cause a binary tree distribution of the command.
For our 20 nodes the .055 vs. .220 seconds isn't very much,
But for 200 nodes it would be .088 vs 2.2, which is a pretty
big difference.
The fast rsh only starts the jobs quickly so I still need
at least a local queue system on each node where:
submit -q name command
will serialize the jobs.
The down side of
the -z flag is that there's no way to tell when the remote
job has completed. So a job termination integrator is also needed.
Preferably something a bit more elegant than having all
the jobs end with
rsh -z master.node touch >/tmp/jobname/`hostname`_done
and having a shell script sleep / count the jobs until the expected
number report termination.
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list