[Beowulf] scheduler policy design

Wed Apr 25 06:15:18 PDT 2007

On 25 Apr 2007, at 8:42 am, Toon Knapen wrote:

> Interesting. However this approach requires that the IO profile of  
> the application is known.

Absolutely.

> Additionally it requires the users of the application (which are  
> generally not IT guys) to know and understand this info and pass it  
> on to the scheduler when they launch their app.

Absolutely.

> In your experience, do you manage to convince real-life users to  
> provide this info?

Not easily.  :-)

And this is the problem with getting scheduling right, and exactly  
what we were saying at the beginning of this discussion.  You can't  
hope to schedule optimally if the scheduler doesn't know the profile  
of the application; the more information it knows the better the job  
it will do.  But if your users, like mine, can't or won't supply this  
information, then you're very limited in what you can achieve, and  
your system will be vulnerable to denial of service because of  
strange mixes of jobs starting on the machines causing them to run  
out of various resources, and there is basically nothing you will be  
able to do about it.

The compromise we ended up with is this set of LSF queues on our  
system (a cluster with about 1500 job slots):

QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND    
RUN  SUSP
yesterday       500  Open:Active     200   10    -    -     1      
0     1     0
normal           30  Open:Active       -    -    -    -   281   110    
171     0
hugemem          30  Open:Active       -    -    -    -     3      
0     3     0
long              3  Open:Active       -    -    -    -  4022  2987   
1035     0
basement          1  Open:Active     300  200    -    -   127     0    
127     0

yesterday:

a special purpose high priority queue for the "I need it yesterday"  
crowd.  No run length limits, but very limited in terms of how many  
slots the user can use.

normal:

queue intended for shortish jobs (around 1 hour).  Absolute wall  
clock limit of 8 hours, after which jobs are killed.

long:

queue for longer jobs with an absolute wall clock limit of 24 hours.

hugemem:

special purpose queue for the two large memory SGI Altix nodes.   
Users submitting jobs to this queue *must* supply memory  
requirements; the submission is rejected if they do not.

basement:

queue for long running or low priority jobs.  No time limits, but  
can't use more than a small fraction of the total cluster.

All the queues except hugemem also have a default memory limit of 1.9  
GB; any job exceeding this limit is killed.  If the user wants to  
raise this limit they can, up to 7.9 GB, but they are then forced by  
the same mechanism as the hugemem queue to supply proper memory  
resource requirements.

Here's an example of what happens if they don't:

--- EXAMPLE ---
14:07:31 tjrc at bc-9-1-03:~$ bsub -M 6000000 uname -a
Job submission rejected.

You are specifying your own memory limit, so you must also supply
select[mem] and rusage[mem] resource requirement parameters.  For
example:

    -M2000000 -R'select[mem>2000] rusage[mem=2000]'

Remember that memory limits are set in KB, resource memory in MB.
Sorry about that.  Blame Platform.

If you do not understand what this means, read the lsfintro manpage and
the following web page:

http://www.wtgc.org/IT/ISG/lsf/lsf_intro.shtml#resources

If you still don't understand after that, contact ssg-isg(at) 
sanger.ac.uk

Request aborted by esub. Job not submitted.
--- EXAMPLE ---

All this is designed so that users who can't or won't supply detailed  
parameters to LSF can still submit work, but they either are limited  
in terms of how many jobs they can run at once (in the yesterday and  
basement queues) or they run the risk of their job being killed if it  
goes astray and uses too much time or memory (in the normal and long  
queues).

Thus, it gives the users an incentive to understand their code and  
use the cluster carefully and responsibly.  Until we put the hard run  
limits in place, the cluster was being brought to its knees at least  
once a week by users just being careless, and that was why we  
eventually had to be somewhat more draconian.  It's worked though;  
the cluster has not had a similar DoS event since putting these rules  
into place.

Regards,

Tim