<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>Yeah this one is tricky. In general we take the wildwest
approach here, but I've had users use --contiguous and their job
takes forever to run.</p>
<p>I suppose one method would would be enforce that each job take a
full node and parallel jobs always have contiguous. As I recall
Slurm will preferentially fill up nodes to try to leave as large
of contiguous blocks as it can.</p>
<p>The other other option would be to use requeue to your
advantage. Namely just have a high priority queue only for large
contiguous jobs and it just requeues all the jobs it needs to to
run. That would depend on your single node/core users tolerances
for being requeued.</p>
<p>-Paul Edmon-<br>
</p>
<br>
<div class="moz-cite-prefix">On 06/08/2018 03:55 AM, John Hearns via
Beowulf wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAPqNE2WWmffswL6VBtf1awzOjJssY0fAn7N2NvPCav=2_7Y54g@mail.gmail.com">
<div dir="ltr">
<div>Chris, good question. I can't give a direct asnwer there,
but let me share my experiences.</div>
<div><br>
</div>
<div>In the past I managed SGI ICE clusters and a large memory
UV system with PBSPro queuing.</div>
<div>The engineers submitted CFD solver jobs using scripts, and
we only allowed them to use a multiple of N cpus,</div>
<div>in fact there were queues named after lets say 2N or 4N cpu
cores. The number of cores were cunningly arranged to fit into
<br>
</div>
<div>what SGI term an IRU, or everyone else would call a blade
chassis.</div>
<div>We had job exclusivity, and engineers were not allowed to
choose how many CPUs they used.</div>
<div>This is a very efficient way to run HPC - as you have a
clear view of how many jobs fit on a cluster.</div>
<div><br>
</div>
<div>Yes, before you say it this does not cater for the mixed
workload with lots of single CPU jobs, Matlab, Python etc....</div>
<div><br>
</div>
<div>When the UV arrived I configured bladesets (placement sets)
such that the scheduler tried to allocate CPUs and memory from
blades</div>
<div>adjacent to each other. Again much better efficiency. If
I'm not wrong you do that in Slurm by defining switches.</div>
<div><br>
</div>
<div>When the high core count AMDs came along again I configured
blade sets and the number of CPUs per job was increased to
cope with</div>
<div>larger core count CPUs but again cunningly arranged to
equal the number of cores in a placement set (placement sets
were configured to be</div>
<div>half, full or two IRUs)</div>
<div><br>
</div>
<div>At another place of employment recently we had a hugely
mixed workload, ranging from interactive graphics, to the
Matlab type jobs,</div>
<div>to multinode CFD jobs. In addition to that we had different
CPU generations and GPUs in the mix.</div>
<div>That setup was a lot harder to manage and keep up the
efficiency of use, as you can imagine.</div>
<div><br>
</div>
<div>I agree with you about the overlapping partitions. If I was
to arrange things in my ideal worls, I would have a set of the
latest generation CPUs</div>
<div>using the latest generation interconnect and reserve them
for 'job exclusive' jobs - ie parallel jobs, and leave other
nodes exclusively for</div>
<div>one node or one core jobs.</div>
<div>Then have some mechanism to grow/shrink the partitions.</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>Ont thing again which I found difficult in my last job was
users 'hard wiring' the number of CPUs they use. In fact I
have seen that quite often on other projects.</div>
<div>What happens is that a new Phd or Postdoc or new engineer
is gifted a job submission script from someone who is leaving,
or moving on.</div>
<div>The new person doesnt really understand why (say) six nodes
with eight CPU cores are requested.</div>
<div>But (a) they just want to get on and do the job (b) they
are scared of breaking things by altering the script.</div>
<div>So the number of CPUs doesnt change and with the latest
generation 20plus cores on a node you get wasted cores.</div>
<div>Also having mixed generations of CPUs with different core
counts does not help here.<br>
</div>
<div><br>
</div>
<div>Yes I know we as HPC admins can easily adjust job scripts
to mpirun with N equal to the number of cores on a node (etc).</div>
<div>In fact when I have worked with users and showed them how
to do this it has been a source of satisfaction to me.<br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 8 June 2018 at 09:21, Chris Samuel <span
dir="ltr"><<a href="mailto:chris@csamuel.org"
target="_blank" moz-do-not-send="true">chris@csamuel.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>
<br>
I'm curious to know what/how/where/if sites do to try and
reduce the impact of <br>
fragmentation of resources by small/narrow jobs on systems
where you also have <br>
to cope with large/wide parallel jobs?<br>
<br>
For my purposes a small/narrow job is anything that will fit
on one node <br>
(whether a single core job, multi-threaded or MPI).<br>
<br>
One thing we're considering is to use overlapping partitions
in Slurm to have <br>
a subset of nodes that are available to these types of jobs
and then have <br>
large parallel jobs use a partition that can access any
node.<br>
<br>
This has the added benefit of letting us set a higher
priority on that <br>
partition to let Slurm try and place those jobs first,
before smaller ones.<br>
<br>
We're already using a similar scheme for GPU jobs where they
get put into a <br>
partition that can access all 36 cores on a node whereas
non-GPU jobs get put <br>
into a partition that can only access 32 cores on a node, so
effectively we <br>
reserve 4 cores a node for GPU jobs.<br>
<br>
But really I'm curious to know what people do about this, or
do you not worry <br>
about it at all and just let the scheduler do its best?<br>
<br>
All the best,<br>
Chris<br>
<span class="HOEnZb"><font color="#888888">-- <br>
Chris Samuel : <a href="http://www.csamuel.org/"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.csamuel.org/</a> :
Melbourne, VIC<br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a
href="mailto:Beowulf@beowulf.org"
moz-do-not-send="true">Beowulf@beowulf.org</a>
sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe)
visit <a
href="http://www.beowulf.org/mailman/listinfo/beowulf"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>
</font></span></blockquote>
</div>
<br>
</div>
<!--'"--><br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit <a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>
</pre>
</blockquote>
<br>
</body>
</html>