[Beowulf] Question about fair share

Tue Jan 25 01:55:23 UTC 2022

On Mon, Jan 24, 2022 at 01:17:30PM -0600, Tom Harvill wrote:
> 
> 
> Hello,
> 
> We use a 'fair share' feature of our scheduler (SLURM) and have our decay
> half-life (the time needed for priority penalty to halve) set to 30 days. 
> Our maximum job runtime is 7 days.  I'm wondering what others use, please
> let me know if you can spare a minute.  Thank you!

We're a Grid Engine shop, not SLURM, but a few years ago we significantly
reduced the weight of the fair-share policy and boosted the relative weight
of the functional policy. The problem we were having was that the 
fair-share policy would take a long time to adjust to sudden changes in
usage and trying to determine what someone's priority would be/should be
based on prior usage could be pretty challenging. The functional policy
adjusts immediately based on current workload and is a lot easier to
comprehend for our users.

I'm not sure what the equivalent of the functional policy is in SLURM but
in GE it's ticket-based where accounts, projects, and "departments" (labs,
in our context) are given some number of tickets which are consumed by
running jobs, and returned when the job finishes. By default, every job
from a single source has an equal share of tickets, but that share is
adjustable on submission so a user can assign a relative importance to
their own jobs.

We also use the urgency policy heavily, where the resource requests
of a job influence its final priority. This lets us boost the priority for
jobs requesting hard-to-satisfy resources (lots of memory on one node,
GPUs, etc.) to avoid starving them amongst a swarm of tiny jobs.

Schedule policy is a really iterative process and took us a long time to
tweak to everyone's (mostly) satisfaction.

-- 
Skylar