Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] SGE + policy

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Robert G. Brown rgb at phy.duke.edu
Thu May 27 08:03:09 PDT 2004


On Thu, 27 May 2004, Gerry Creager N5JXS wrote:

> This is really a first-cut response, with 2 visible possibilities...
> 
> 1.  Use 2 license servers, one with 'i' licenses available for short 
> jobs, and one with 'j' licenses available for longer jobs.  For i < j, 
> starvation of the short jobs shouldn't occur too often, save when 
> there's too many masters' students trying to get their projects done in 
> time to graduate and the deadline's tomorrow.
> 
> 2.  Priority queuing where short jobs have the nod, and longer jobs are 
> put aside and required to temporarily relinquish licenses.  Liketo to 
> require programming resources to accomplish this one.
> 
> Good question.

Thanks for the suggestions.

The lack of even coarse grained kernel-style job control for a cluster
continues to be a source of frustration.  load balancing queueing systems
are getting to be pretty good, but this isn't a problem in load
balanced queueing, and a kernel that used load balanced queueing as a
scheduler algorithm would be terrible.  No, wait!  It would be DOS (for
a single CPU).

With xmlsysd I have access to the data required to implement a queueing
system WITH a crude scheduler algorithm with a granularity of (say)
order minutes.  I've actually hacked out a couple of tries at a simple
script-level control system in perl (before per got threads).  One would
expect that with threads it would be pretty easy to write a script based
scheduler that issues STOP and CONT signals to tasks on some sort of
RR/priority basis every minute.  It wouldn't deal with license
starvation, since I don't know how a running matlab task can
"temporarily relinquish a license" while it is stopped, but it would
manage the problem of being able to use a cluster for a mix of
prioritized long and short running jobs without resource-starving the
short ones.

I have a personal interest in this outside of econ because I am, after
all, a bottom feeder in the cluster world.  If I could ever arrange it
so that my jobs just "got out of the way" when competing jobs were
queued on a cluster according to policy, priority, ownership etc. I
might be able to wheedle more cycles out of my friends...;-)

   rgb

> 
> Gerry
> 
> Robert G. Brown wrote:
> > Dear Perfect Masters of Grid Computing:
> > 
> > Economics is preparing to set up a small pilot cluster at Duke and the
> > following question has come up.
> > 
> > Primary tasks:  matlab and stata jobs, run either interactively/remote
> > or (more likely) in batch mode.  Jobs include both "short" jobs that
> > might take 10-30 minutes run by e.g. 1-2nd year graduate students as
> > part of their coursework and "long" jobs that might take hours to days
> > run by more advanced students, postdocs, faculty.
> > 
> > Constraint:  matlab requires a license managed by a license manager.
> > There are a finite number of licenses (currently less than the number of
> > CPUs) spread out across the pool of CPUs.
> > 
> > Concern:  That long running jobs will get into the queue (probably SGE
> > managed queue) and starve the short running jobs for either licenses or
> > CPUs or both.  Students won't be able to finish their homework in a
> > timely way because long running jobs de facto hog the resource once they
> > are given a license/CPU.
> > 
> > I am NOT an SGE expert, although I've played with it a bit and read a
> > fair bit of the documention.  SGE appears to run in FIFO mode, which of
> > course would lead to precisely the sort of resource starvation feared or
> > equal share mode.  Equal share mode appears to solve a different
> > resource starvation problem -- that produced by a single user or group
> > saturating the queue with lots of jobs, little or big, so that others
> > submitting after they've loaded the queue have to wait days or weeks to
> > get on.  However, it doesn't seem to have anything to do with job
> > 
> >>>control<< according to a policy -- stopping a long running job so that
> > 
> > a short running job can pass through.
> > 
> > It seems like this would be a common problem in shared environments with
> > a highly mixed workload and lots of users (and indeed is the problem
> > addressed by e.g. the kernel scheduler in almost precisely the same
> > context on SMP or UP machines).  Recognizing that the license management
> > problem will almost certainly be beyond the scope of any solution
> > without some hacking and human-level policy, are there any well known
> > solutions to this well known problem?  Can SGE actually automagically
> > control jobs (stopping and starting jobs as a sort of coarse-grained
> > scheduler to permit high priority jobs to pass through long running low
> > priority jobs)?  Is there a way to solve this with job classes or
> > wrapper scripts that is in common use?
> > 
> > At your feet, your humble student waits, oh masters of SGE and Grids...
> > 
> >     rgb
> > 
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list