[Beowulf] Reliable Job Queueing and Notification
Sean Ward
SeanWard at msn.com
Tue Oct 16 07:08:14 PDT 2007
I've started work on a web service which contains several potentially
long running processing steps (molecular dynamics), which are perfect to
farm out to the fairly large (90 node) Beowulf I have access to. The
primary issue is translating requests from the event driven web service,
to job queues, and back again upon completion. Specifically, the major
queuing systems I have immediate access to (Sun Grid Engine and Condor)
only support e-mail based notification of job completion. Starting jobs
isn't an issue, as my service can simply ssh over and execute shell
scripts as needed to start things up, the problem is reliably being
informed when the jobs fail or complete, via any programmatic method
(such as executing a shell script, calling a web service via SOAP/etc,
or an asynchronous message library). My other problem, ensuring that
these web service requests don't starve in house jobs on the Beowulf is
easily handled via the priority levels built into all the various job
managers, although being able to checkpoint a long running job would be
a plus (such as is supported by Condor).
I am currently investigating modifications to either Condor (more
complex to update, but checkpoint is useful) or Ruby Queue (very easy to
update for reliable notification) to solve this issue, but wanted to be
sure I wasn't overlooking any existing solutions to programmatic based
queuing and receiving notifications on jobs in a Beowulf environment...
-Sean
More information about the Beowulf
mailing list