[Beowulf] SGE + policy

Patrice Seyed apseyed at bu.edu
Thu May 27 08:16:05 PDT 2004

Dr. Brown,

The convention that I've used as suggested by the gurus on the SGE mailing
list is the use of the concept of express queues, which uses a resource
assignment for "express" and subordinate queuing. 

SGE usually sets one queue per host (if its dual this needs to be modified
slightly). Say your first node has one cpu and is called node-1. Set up a
usual queue called "node-1.q" with one job slot, and set it up to be
subordinate to "express-1.q" at the 1 job level, and create a queue called
"express-1.q" that has one job slot, create a resource called "express" for
this queue, and set a soft/hard limit of rt to 2:00.

Basically addresses the scenario where a user wants to submit a job that
takes less than 2 hours and all the regular queues are full. They can submit
their jobs with the "-l express=1" option and the job will go into an
express queue belonging to one of the hosts, will suspend the long job in
the regular queue until the express job is complete. What makes this work is
the restriction the hard limit of 2 hours for this suspension mechanism. I
hope this helps.

Regarding the license managing you could do something with consumable
resources/tracking, also I think that you can use FlexLM. 



-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
Behalf Of Robert G. Brown
Sent: Thursday, May 27, 2004 10:19 AM
To: Beowulf Mailing List
Subject: [Beowulf] SGE + policy

Dear Perfect Masters of Grid Computing:

Economics is preparing to set up a small pilot cluster at Duke and the
following question has come up.

Primary tasks:  matlab and stata jobs, run either interactively/remote
or (more likely) in batch mode.  Jobs include both "short" jobs that
might take 10-30 minutes run by e.g. 1-2nd year graduate students as
part of their coursework and "long" jobs that might take hours to days
run by more advanced students, postdocs, faculty.

Constraint:  matlab requires a license managed by a license manager.
There are a finite number of licenses (currently less than the number of
CPUs) spread out across the pool of CPUs.

Concern:  That long running jobs will get into the queue (probably SGE
managed queue) and starve the short running jobs for either licenses or
CPUs or both.  Students won't be able to finish their homework in a
timely way because long running jobs de facto hog the resource once they
are given a license/CPU.

I am NOT an SGE expert, although I've played with it a bit and read a
fair bit of the documention.  SGE appears to run in FIFO mode, which of
course would lead to precisely the sort of resource starvation feared or
equal share mode.  Equal share mode appears to solve a different
resource starvation problem -- that produced by a single user or group
saturating the queue with lots of jobs, little or big, so that others
submitting after they've loaded the queue have to wait days or weeks to
get on.  However, it doesn't seem to have anything to do with job
>>control<< according to a policy -- stopping a long running job so that
a short running job can pass through.

It seems like this would be a common problem in shared environments with
a highly mixed workload and lots of users (and indeed is the problem
addressed by e.g. the kernel scheduler in almost precisely the same
context on SMP or UP machines).  Recognizing that the license management
problem will almost certainly be beyond the scope of any solution
without some hacking and human-level policy, are there any well known
solutions to this well known problem?  Can SGE actually automagically
control jobs (stopping and starting jobs as a sort of coarse-grained
scheduler to permit high priority jobs to pass through long running low
priority jobs)?  Is there a way to solve this with job classes or
wrapper scripts that is in common use?

At your feet, your humble student waits, oh masters of SGE and Grids...


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list