[Beowulf] Interactive vs batch, and schedulers [EXT]

Fri Jan 17 06:16:32 PST 2020

In the Grid Engine world, we've worked around some of the resource
fragmentation issues by assigning static sequence numbers to queue
instances (a node publishing resources to a queue) and then having the
scheduler fill nodes by sequence number rather than spreading jobs across
the cluster. This leaves some nodes free of jobs unless a really big job
comes in that requires entire nodes.

Since we're a bioinformatics shop, most of our jobs aren't parallel, though
a few job types require lots of memory (we have a handful of nodes in the
1TB-4TB RAM range). Grid Engine lets us isolate jobs from each other using
cgroups, where a job resource request is translated directly to the
resource (memory, CPU, etc.) limits of a cgroup.

On Fri, Jan 17, 2020 at 08:44:14AM +0000, Tim Cutts wrote:
>    Indeed, and you can quite easily get into a “boulders and sand”
>    scheduling problem; if you allow the small interactive jobs (the sand)
>    free access to everything, the scheduler tends to find them easy to
>    schedule, partially fills nodes with them, and then finds it can’t find
>    contiguous resources large enough for the big parallel jobs (the
>    boulders), and you end up with the large batch jobs pending forever.
> 
>    I’ve tried various approaches to this in the past; for example
>    pre-emption of large long running jobs, but that causes resource
>    starvation (suspended jobs are still consuming virtual memory) and then
>    all sorts of issues with timeouts on TCP connections and so on and so
>    forth, these being genomics jobs with lots of not-normal-HPC activities
>    like talking to relational databases etc.
> 
>    I think you always end up having to ring-fence hardware for the large
>    parallel batch jobs, and not allow the interactive stuff on it.
> 
>    This of course is what leads some users to favour the cloud, because it
>    appears to be infinite, and so the problem appears to go away.  But
>    let's not get into that argument here.
> 
>    Tim
> 
>    On 16 Jan 2020, at 23:50, Alex Chekholko via Beowulf
>    <[1]beowulf at beowulf.org> wrote:
> 
>    Hey Jim,
>    There is an inverse relationship between latency and throughput.  Most
>    supercomputing centers aim to keep their overall utilization high, so
>    the queue always needs to be full of jobs.
>    If you can have 1000 nodes always idle and available, then your 1000
>    node jobs will usually take 10 seconds.  But your overall utilization
>    will be in the low single digit percent or worse.
>    Regards,
>    Alex
>    On Thu, Jan 16, 2020 at 3:25 PM Lux, Jim (US 337K) via Beowulf
>    <[2]beowulf at beowulf.org> wrote:
> 
>    Are there any references out there that discuss the tradeoffs between
>    interactive and batch scheduling (perhaps some from the 60s and 70s?) –
> 
>    Most big HPC systems have a mix of giant jobs and smaller ones managed
>    by some process like PBS or SLURM, with queues of various sized jobs.
> 
> 
>    What I’m interested in is the idea of jobs that, if spread across many
>    nodes (dozens) can complete in seconds (<1 minute) providing
>    essentially “interactive” access, in the context of large jobs taking
>    days to complete.   It’s not clear to me that the current schedulers
>    can actually do this – rather, they allocate M of N nodes to a
>    particular job pulled out of a series of queues, and that job “owns”
>    the nodes until it completes.  Smaller jobs get run on (M-1) of the N
>    nodes, and presumably complete faster, so it works down through the
>    queue quicker, but ultimately, if you have a job that would take, say,
>    10 seconds on 1000 nodes, it’s going to take 20 minutes on 10 nodes.
> 
> 
>    Jim
> 
> 
> 
>    --
> 
> 
>      _______________________________________________
>      Beowulf mailing list, [3]Beowulf at beowulf.org sponsored by Penguin
>      Computing
>      To change your subscription (digest mode or unsubscribe) visit
>      [4]https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>      [beowulf.org]
> 
>    _______________________________________________
>    Beowulf mailing list, [5]Beowulf at beowulf.org sponsored by Penguin
>    Computing
>    To change your subscription (digest mode or unsubscribe) visit
>    [6]https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi
>    -2Dbin_mailman_listinfo_beowulf&d=DwIGaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm
>    8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jU
>    X3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm
>    2s51IZb0H8o&e=
> 
>    -- The Wellcome Sanger Institute is operated by Genome Research
>    Limited, a charity registered in England with number 1021457 and a
>    company registered in England with number 2742969, whose registered
>    office is 215 Euston Road, London, NW1 2BE.
> 
> References
> 
>    1. mailto:beowulf at beowulf.org
>    2. mailto:beowulf at beowulf.org
>    3. mailto:Beowulf at beowulf.org
>    4. https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jUX3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm2s51IZb0H8o&e=
>    5. mailto:Beowulf at beowulf.org
>    6. https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwIGaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jUX3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm2s51IZb0H8o&e=

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

-- 
Skylar