[Beowulf] cluster scheduler for dynamic tree-structured jobs?

Sat May 15 03:24:54 PDT 2010

Folks, I could use some advice on which cluster job scheduler (batch
queuing system) would be most appropriate for my particular needs.
I've looked through docs for SGE, Slurm, etc., but without first-hand
experience with each one it's not at all clear to me which I should
choose...

I've used Sun Grid Engine for this in the past, but the result was
very klunky and hard to maintain.  SGE seems to have all the necessary
features underneath, but no good programming API, and its command-line
tools often behave in ways that make them a poor substitute.

Here's my current list of needs/wants, starting with the ones that
probably make my use case more unusual:

1. I have lots of embarrassingly parallel tree-structured jobs which I
dynamically generate and submit from top-level user code (which
happens to be written in R).  E.g., my user code generates 10 or 100
or 1000 jobs, and each of those jobs might itself generate N jobs.
Any given job cannot complete until all its children complete.

Also, multiple users may be submitting unrelated jobs at the same
time, some of their jobs should have higher priority than others, etc.
(The usual reasons for wanting to use a cluster scheduler in the first
place, I think.)

Thus, merely assigning the individual jobs to compute nodes is not
enough, I need the cluster scheduler to also understand the tree
relationships between the jobs.  Without that, it'd be too easy to get
into a live-lock situation, where all the nodes are tied up with jobs,
none of which can complete because they are waiting for child jobs
which cannot be scheduled.

2. Sometimes I can statically figure out the full tree structure of my
jobs ahead of time, but other times I can't or won't, so I definitely
need a scheduler that lets me submit new sub-jobs on the fly, from any
node in the cluster.

3. The jobs are ultimately all submitted by a small group of people
who talk to each other, so I don't really care about any fancy
security, cost accounting, "grid" support, or other such features
aimed at large and/or loosely coupled organizations.

4. I really, really want a good API for programmably interacting with
the cluster scheduler and ALL of its features.  I don't care too much
what language the API is in as long as it's reasonably sane and I can
readily write glue code to interface it to my language of choice.

5. Although I don't currently do any MPI programming, I would very
much like the option to do so in the future, and integrate it smoothly
with the cluster scheduler.  I assume pretty much all cluster
schedulers have that, though.  (Erlang integration might also be nice.)

6. Each of my individual leaf-node jobs will typically take c. 3 to 30
minutes to complete, so my use shouldn't stress the scheduler's own
performance too much.  However, sometimes I screw that up and submit
tons of jobs that each want to run for only a small amount of time,
say 2 minutes or less, so it'd be nice if the scheduler is
sufficiently efficient and low-latency to keep up with that.

7. When I submit a job, I should be able to easily (and optionally)
give the scheduler my estimates of how much RAM and cpu time the job
will need.  The scheduler should track what resources the job ACTUALLY
uses, and make it easy for me to monitor job status for both running
and completed jobs, and then use that information to improve my
resource estimates for future jobs.  (AKA good APIs, yet again.)

8. Of course the scheduler must have a good way to track all the basic
information about my nodes:  CPU sockets and cores, RAM, etc.  Ideally
it'd also be straightforward for me to extend the database of node
properties as I see fit.  Bonus points if it uses a good database
(e.g. SQLite, PostgreSQL) and a reasonable data model for that stuff.

Thanks in advance for your help and advice!

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/