[Beowulf] scheduler policy design

Tue Apr 24 06:47:01 PDT 2007

On 24 Apr 2007, at 1:30 pm, Toon Knapen wrote:

> Tim Cutts wrote:
>
>>> but what if you have a bi-cpu bi-core machine to which you assign  
>>> 4 slots. Now one slot is being used by a process which performs  
>>> heavy IO. Suppose another process is launched that performs heavy  
>>> IO. In that case the latter process should wait until the first  
>>> one is done to avoid slowing down the efficiency of the system.  
>>> Generally however, clusters take only time and memory  
>>> requirements into account.
>> I think that varies.  LSF records the current I/O of a node as one  
>> of its load indices, so you can request a node which is doing less  
>> than a certain amount of I/O.  I imagine the same is true of SGE,  
>> but I wouldn't know.
>
>
> Indeed, using SGE you could also take this into account. However if  
> someone submits 4 jobs, the jobs do not directly start to generate  
> heavy I/O. So the scheduler might think that the 4 jobs can easily  
> coexist on this same node. However, after a few minutes all 4 jobs  
> start eating disk BW and slow the node down horribly. What would  
> your suggestion be to solve this ?

With LSF, you use resource reservation, using an rusage[] statement.   
Let's say, for example, that you want to keep IO on the node below 15  
MB/sec (just for argument's sake) and you know that your code  
performs I/O at 5 MB/sec.  Let's also assume that the node can only  
15 MB/sec total (which is pathetic, I know, but serves to illustrate  
the example).  This means you know that you only want to start a job  
if the current I/O load is less than 10 MB/sec.  So, you tell LSF the  
following:

bsub -R"select[io <= 10000] rusage[io=5000]" ...

So, to show what LSF does in this case, on a single machine with four  
processors:

This machine, given the above other conditions, would become  
overloaded if LSF started four jobs on it, but it can cope with  
three.  This is what happens:

Initial state:  0 jobs running, io load is 0.  reserved io is 0.

load+reserved is <= 10000, so LSF starts a job.

State:  1 job running, io load is 0, reserved io is 5000

load+reserved still <= 10000, so LSF starts another job

State:  2 jobs running, io load is 0, reserved io is 10000

load+reserved is still <= 10000, so LSF starts another job

State:  3 jobs running, io load is 0, reserved io is 15000

load+reserved is now >10000, so LSF will not start the fourth job,  
even though a processor is available, and the three currently running  
jobs haven't started performing their massive I/O yet.

This scheme works quite well, but has some caveats:

1)  It is still vulnerable to someone submitting an I/O intensive job  
without appropriate resource requirements (but that's back to my  
original point; if you don't give the scheduler the right  
information, it can't possibly schedule optimally).  You can always  
implement an esub rule to force people to add the appropriate  
resources (I do precisely that for memory intensive jobs, using  
exactly this technique).

2)  The syntax Platform use only works well for jobs which use a  
resource throughout their life, or for a limited period at the  
beginning.  For cases where it only does something for a limited  
period at the end, you *have* to reserve the resource for the entire  
lifetime of the job.  This isn't optimal, but without a time machine  
it's hard to do it any other way.

Tim.