<div dir="ltr"><div>Hey Jim,</div><div><br></div>There is an inverse relationship between latency and throughput. Most supercomputing centers aim to keep their overall utilization high, so the queue always needs to be full of jobs.<div><br></div><div>If you can have 1000 nodes always idle and available, then your 1000 node jobs will usually take 10 seconds. But your overall utilization will be in the low single digit percent or worse.</div><div><br></div><div>Regards,</div><div>Alex</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 16, 2020 at 3:25 PM Lux, Jim (US 337K) via Beowulf <<a href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="EN-US">
<div class="gmail-m_-7849349986852910975WordSection1">
<p class="MsoNormal"><span style="font-size:11pt">Are there any references out there that discuss the tradeoffs between interactive and batch scheduling (perhaps some from the 60s and 70s?) –
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt">Most big HPC systems have a mix of giant jobs and smaller ones managed by some process like PBS or SLURM, with queues of various sized jobs.
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt">What I’m interested in is the idea of jobs that, if spread across many nodes (dozens) can complete in seconds (<1 minute) providing essentially “interactive” access, in the context of large jobs taking days
to complete. It’s not clear to me that the current schedulers can actually do this – rather, they allocate M of N nodes to a particular job pulled out of a series of queues, and that job “owns” the nodes until it completes. Smaller jobs get run on (M-1)
of the N nodes, and presumably complete faster, so it works down through the queue quicker, but ultimately, if you have a job that would take, say, 10 seconds on 1000 nodes, it’s going to take 20 minutes on 10 nodes.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt">Jim<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:11pt">-- <u></u><u></u></span></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
</div>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
</blockquote></div>