[Beowulf] Reliable Job Queueing and Notification

Chris Dagdigian dag at sonsorol.org
Wed Oct 17 11:47:00 PDT 2007


Sean,

For what it's worth, Grid Engine (SGE) has a utility binary called  
"qevent" that is not part of the official binary distribution but can  
be built from the source distribution (http:// 
gridengine.sunsource.net). Do a google search for "sge + qevent" and  
you'll at least hit a few SGE mailing list messages that cover what  
it does.

You might also want to check out the DRMAA stuff (http://drmaa.org/ 
wiki/) -- it is supposed to be a DRM-neutral way of submitting jobs  
to a queuing system. I'm not very familiar with DRMAA so I can't tell  
you offhand if the current spec includes notification of completed  
events or not.

Another option that would work with SGE would be the use of queue  
level epilog scripts that execute each time a job leaves the system  
for whatever reason. You can put a heck of a lot of logic and  
programmable activities/notifications into a custom epilog script.

A third option is the use of job dependency syntax within grid  
engine. For each of your web service initiated tasks you would submit  
2 jobs -- the first job is your "worker" job. The second job is your  
"notifier" job and it is submitted to SGE with a flag that says "this  
job is dependent on the worker job". Once your notifier job is fired  
up it can do whatever sort of results checking and notification would  
be required.

Regards,
Chris



On Oct 16, 2007, at 10:08 AM, Sean Ward wrote:

> I've started work on a web service which contains several  
> potentially long running processing steps (molecular dynamics),  
> which are perfect to farm out to the fairly large (90 node) Beowulf  
> I have access to. The primary issue is translating requests from  
> the event driven web service, to job queues, and back again upon  
> completion. Specifically, the major queuing systems I have  
> immediate access to (Sun Grid Engine and Condor) only support e- 
> mail based notification of job completion. Starting jobs isn't an  
> issue, as my service can simply ssh over and execute shell scripts  
> as needed to start things up, the problem is reliably being  
> informed when the jobs fail or complete, via any programmatic  
> method (such as executing a shell script, calling a web service via  
> SOAP/etc, or an asynchronous message library). My other problem,  
> ensuring that these web service requests don't starve in house jobs  
> on the Beowulf is easily handled via the priority levels built into  
> all the various job managers, although being able to checkpoint a  
> long running job would be a plus (such as is supported by Condor).
>
> I am currently investigating modifications to either Condor (more  
> complex to update, but checkpoint is useful) or Ruby Queue (very  
> easy to update for reliable notification) to solve this issue, but  
> wanted to be sure I wasn't overlooking any existing solutions to  
> programmatic based queuing and receiving notifications on jobs in a  
> Beowulf environment...



More information about the Beowulf mailing list