[Beowulf] Reliable Job Queueing and Notification

Sean Ward SeanWard at msn.com
Tue Oct 16 07:08:14 PDT 2007


I've started work on a web service which contains several potentially 
long running processing steps (molecular dynamics), which are perfect to 
farm out to the fairly large (90 node) Beowulf I have access to. The 
primary issue is translating requests from the event driven web service, 
to job queues, and back again upon completion. Specifically, the major 
queuing systems I have immediate access to (Sun Grid Engine and Condor) 
only support e-mail based notification of job completion. Starting jobs 
isn't an issue, as my service can simply ssh over and execute shell 
scripts as needed to start things up, the problem is reliably being 
informed when the jobs fail or complete, via any programmatic method 
(such as executing a shell script, calling a web service via SOAP/etc, 
or an asynchronous message library). My other problem, ensuring that 
these web service requests don't starve in house jobs on the Beowulf is 
easily handled via the priority levels built into all the various job 
managers, although being able to checkpoint a long running job would be 
a plus (such as is supported by Condor).

I am currently investigating modifications to either Condor (more 
complex to update, but checkpoint is useful) or Ruby Queue (very easy to 
update for reliable notification) to solve this issue, but wanted to be 
sure I wasn't overlooking any existing solutions to programmatic based 
queuing and receiving notifications on jobs in a Beowulf environment...

-Sean



More information about the Beowulf mailing list