[Beowulf] Reliable Job Queueing and Notification

Bernard Li bernard at vanhpc.org
Wed Oct 17 11:31:26 PDT 2007


Hi Sean:

On 10/16/07, Sean Ward <SeanWard at msn.com> wrote:

> I've started work on a web service which contains several potentially
> long running processing steps (molecular dynamics), which are perfect to
> farm out to the fairly large (90 node) Beowulf I have access to. The
> primary issue is translating requests from the event driven web service,
> to job queues, and back again upon completion. Specifically, the major
> queuing systems I have immediate access to (Sun Grid Engine and Condor)
> only support e-mail based notification of job completion. Starting jobs
> isn't an issue, as my service can simply ssh over and execute shell
> scripts as needed to start things up, the problem is reliably being
> informed when the jobs fail or complete, via any programmatic method
> (such as executing a shell script, calling a web service via SOAP/etc,
> or an asynchronous message library). My other problem, ensuring that
> these web service requests don't starve in house jobs on the Beowulf is
> easily handled via the priority levels built into all the various job
> managers, although being able to checkpoint a long running job would be
> a plus (such as is supported by Condor).
>
> I am currently investigating modifications to either Condor (more
> complex to update, but checkpoint is useful) or Ruby Queue (very easy to
> update for reliable notification) to solve this issue, but wanted to be
> sure I wasn't overlooking any existing solutions to programmatic based
> queuing and receiving notifications on jobs in a Beowulf environment...

If you plan to stay with the SGE/Condor route, you should take a look at DRMAA:

http://drmaa.org/wiki/

Perhaps you will find something useful there.

Cheers,

Bernard



More information about the Beowulf mailing list