<div dir="ltr"><div>Chris, good question. I can't give a direct asnwer there, but let me share my experiences.</div><div><br></div><div>In the past I managed SGI ICE clusters and a large memory UV system with PBSPro queuing.</div><div>The engineers submitted CFD solver jobs using scripts, and we only allowed them to use a multiple of N cpus,</div><div>in fact there were queues named after lets say 2N or 4N cpu cores. The number of cores were cunningly arranged to fit into <br></div><div>what SGI term an IRU, or everyone else would call a blade chassis.</div><div>We had job exclusivity, and engineers were not allowed to choose how many CPUs they used.</div><div>This is a very efficient way to run HPC - as you have a clear view of how many jobs fit on a cluster.</div><div><br></div><div>Yes, before you say it this does not cater for the mixed workload with lots of single CPU jobs, Matlab, Python etc....</div><div><br></div><div>When the UV arrived I configured bladesets (placement sets) such that the scheduler tried to allocate CPUs and memory from blades</div><div>adjacent to each other. Again much better efficiency. If I'm not wrong you do that in Slurm by defining switches.</div><div><br></div><div>When the high core count AMDs came along again I configured blade sets and the number of CPUs per job was increased to cope with</div><div>larger core count CPUs but again cunningly arranged to equal the number of cores in a placement set (placement sets were configured to be</div><div>half, full or two IRUs)</div><div><br></div><div>At another place of employment recently we had a hugely mixed workload, ranging from interactive graphics, to the Matlab type jobs,</div><div>to multinode CFD jobs. In addition to that we had different CPU generations and GPUs in the mix.</div><div>That setup was a lot harder to manage and keep up the efficiency of use, as you can imagine.</div><div><br></div><div>I agree with you about the overlapping partitions. If I was to arrange things in my ideal worls, I would have a set of the latest generation CPUs</div><div>using the latest generation interconnect and reserve them for 'job exclusive' jobs - ie parallel jobs, and leave other nodes exclusively for</div><div>one node or one core jobs.</div><div>Then have some mechanism to grow/shrink the partitions.</div><div><br></div><div><br></div><div><br></div><div><br></div><div>Ont thing again which I found difficult in my last job was users 'hard wiring' the number of CPUs they use. In fact I have seen that quite often on other projects.</div><div>What happens is that a new Phd or Postdoc or new engineer is gifted a job submission script from someone who is leaving, or moving on.</div><div>The new person doesnt really understand why (say) six nodes with eight CPU cores are requested.</div><div>But (a) they just want to get on and do the job (b) they are scared of breaking things by altering the script.</div><div>So the number of CPUs doesnt change and with the latest generation 20plus cores on a node you get wasted cores.</div><div>Also having mixed generations of CPUs with different core counts does not help here.<br></div><div><br></div><div>Yes I know we as HPC admins can easily adjust job scripts to mpirun with N equal to the number of cores on a node (etc).</div><div>In fact when I have worked with users and showed them how to do this it has been a source of satisfaction to me.<br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 8 June 2018 at 09:21, Chris Samuel <span dir="ltr"><<a href="mailto:chris@csamuel.org" target="_blank">chris@csamuel.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>
<br>
I'm curious to know what/how/where/if sites do to try and reduce the impact of <br>
fragmentation of resources by small/narrow jobs on systems where you also have <br>
to cope with large/wide parallel jobs?<br>
<br>
For my purposes a small/narrow job is anything that will fit on one node <br>
(whether a single core job, multi-threaded or MPI).<br>
<br>
One thing we're considering is to use overlapping partitions in Slurm to have <br>
a subset of nodes that are available to these types of jobs and then have <br>
large parallel jobs use a partition that can access any node.<br>
<br>
This has the added benefit of letting us set a higher priority on that <br>
partition to let Slurm try and place those jobs first, before smaller ones.<br>
<br>
We're already using a similar scheme for GPU jobs where they get put into a <br>
partition that can access all 36 cores on a node whereas non-GPU jobs get put <br>
into a partition that can only access 32 cores on a node, so effectively we <br>
reserve 4 cores a node for GPU jobs.<br>
<br>
But really I'm curious to know what people do about this, or do you not worry <br>
about it at all and just let the scheduler do its best?<br>
<br>
All the best,<br>
Chris<br>
<span class="HOEnZb"><font color="#888888">-- <br>
Chris Samuel : <a href="http://www.csamuel.org/" rel="noreferrer" target="_blank">http://www.csamuel.org/</a> : Melbourne, VIC<br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>
</font></span></blockquote></div><br></div>