[Beowulf] Interactive vs batch, and schedulers [EXT]
Skylar Thompson
skylar.thompson at gmail.com
Fri Jan 17 06:16:32 PST 2020
In the Grid Engine world, we've worked around some of the resource
fragmentation issues by assigning static sequence numbers to queue
instances (a node publishing resources to a queue) and then having the
scheduler fill nodes by sequence number rather than spreading jobs across
the cluster. This leaves some nodes free of jobs unless a really big job
comes in that requires entire nodes.
Since we're a bioinformatics shop, most of our jobs aren't parallel, though
a few job types require lots of memory (we have a handful of nodes in the
1TB-4TB RAM range). Grid Engine lets us isolate jobs from each other using
cgroups, where a job resource request is translated directly to the
resource (memory, CPU, etc.) limits of a cgroup.
On Fri, Jan 17, 2020 at 08:44:14AM +0000, Tim Cutts wrote:
> Indeed, and you can quite easily get into a “boulders and sand”
> scheduling problem; if you allow the small interactive jobs (the sand)
> free access to everything, the scheduler tends to find them easy to
> schedule, partially fills nodes with them, and then finds it can’t find
> contiguous resources large enough for the big parallel jobs (the
> boulders), and you end up with the large batch jobs pending forever.
>
> I’ve tried various approaches to this in the past; for example
> pre-emption of large long running jobs, but that causes resource
> starvation (suspended jobs are still consuming virtual memory) and then
> all sorts of issues with timeouts on TCP connections and so on and so
> forth, these being genomics jobs with lots of not-normal-HPC activities
> like talking to relational databases etc.
>
> I think you always end up having to ring-fence hardware for the large
> parallel batch jobs, and not allow the interactive stuff on it.
>
> This of course is what leads some users to favour the cloud, because it
> appears to be infinite, and so the problem appears to go away. But
> let's not get into that argument here.
>
> Tim
>
> On 16 Jan 2020, at 23:50, Alex Chekholko via Beowulf
> <[1]beowulf at beowulf.org> wrote:
>
> Hey Jim,
> There is an inverse relationship between latency and throughput. Most
> supercomputing centers aim to keep their overall utilization high, so
> the queue always needs to be full of jobs.
> If you can have 1000 nodes always idle and available, then your 1000
> node jobs will usually take 10 seconds. But your overall utilization
> will be in the low single digit percent or worse.
> Regards,
> Alex
> On Thu, Jan 16, 2020 at 3:25 PM Lux, Jim (US 337K) via Beowulf
> <[2]beowulf at beowulf.org> wrote:
>
> Are there any references out there that discuss the tradeoffs between
> interactive and batch scheduling (perhaps some from the 60s and 70s?) –
>
> Most big HPC systems have a mix of giant jobs and smaller ones managed
> by some process like PBS or SLURM, with queues of various sized jobs.
>
>
> What I’m interested in is the idea of jobs that, if spread across many
> nodes (dozens) can complete in seconds (<1 minute) providing
> essentially “interactive” access, in the context of large jobs taking
> days to complete. It’s not clear to me that the current schedulers
> can actually do this – rather, they allocate M of N nodes to a
> particular job pulled out of a series of queues, and that job “owns”
> the nodes until it completes. Smaller jobs get run on (M-1) of the N
> nodes, and presumably complete faster, so it works down through the
> queue quicker, but ultimately, if you have a job that would take, say,
> 10 seconds on 1000 nodes, it’s going to take 20 minutes on 10 nodes.
>
>
> Jim
>
>
>
> --
>
>
> _______________________________________________
> Beowulf mailing list, [3]Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> [4]https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> [beowulf.org]
>
> _______________________________________________
> Beowulf mailing list, [5]Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> [6]https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi
> -2Dbin_mailman_listinfo_beowulf&d=DwIGaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm
> 8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jU
> X3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm
> 2s51IZb0H8o&e=
>
> -- The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
> References
>
> 1. mailto:beowulf at beowulf.org
> 2. mailto:beowulf at beowulf.org
> 3. mailto:Beowulf at beowulf.org
> 4. https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jUX3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm2s51IZb0H8o&e=
> 5. mailto:Beowulf at beowulf.org
> 6. https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwIGaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jUX3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm2s51IZb0H8o&e=
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
--
Skylar
More information about the Beowulf
mailing list