[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

John Hearns hearnsj at googlemail.com
Fri Jun 8 00:55:19 PDT 2018


Chris, good question. I can't give a direct asnwer there, but let me share
my experiences.

In the past I managed SGI ICE clusters and a large memory UV system with
PBSPro queuing.
The engineers submitted CFD solver jobs using scripts, and we only allowed
them to use a multiple of N cpus,
in fact there were queues named after lets say 2N or 4N cpu cores. The
number of cores were cunningly arranged to fit into
what SGI term an IRU, or everyone else would call a blade chassis.
We had job exclusivity, and engineers were not allowed to choose how many
CPUs they used.
This is a very efficient way to run HPC - as you have a clear view of how
many jobs fit on a cluster.

Yes, before you say it this does not cater for the mixed workload with lots
of single CPU jobs, Matlab, Python etc....

When the UV arrived I configured bladesets (placement sets) such that the
scheduler tried to allocate CPUs and memory from blades
adjacent to each other. Again much better efficiency. If I'm not wrong you
do that in Slurm by defining switches.

When the high core count AMDs came along again I configured blade sets and
the number of CPUs per job was increased to cope with
larger core count CPUs but again cunningly arranged to equal the number of
cores in a placement set (placement sets were configured to be
half, full or two IRUs)

At another place of employment recently we had a hugely mixed workload,
ranging from interactive graphics, to the Matlab type jobs,
to multinode CFD jobs. In addition to that we had different CPU generations
and GPUs in the mix.
That setup was a lot harder to manage and keep up the efficiency of use, as
you can imagine.

I agree with you about the overlapping partitions. If I was to arrange
things in my ideal worls, I would have a set of the latest generation CPUs
using the latest generation interconnect and reserve them for 'job
exclusive' jobs - ie parallel jobs, and leave other nodes exclusively for
one node or one core jobs.
Then have some mechanism to grow/shrink the partitions.




Ont thing again which I found difficult in my last job was users 'hard
wiring' the number of CPUs they use. In fact I have seen that quite often
on other projects.
What happens is that a new Phd or Postdoc or new engineer is gifted a job
submission script from someone who is leaving, or moving on.
The new person doesnt really understand why (say) six nodes with eight CPU
cores are requested.
But (a) they just want to get on and do the job (b) they are scared of
breaking things by altering the script.
So the number of CPUs doesnt change and with the latest generation 20plus
cores on a node you get wasted cores.
Also having mixed generations of CPUs with different core counts does not
help here.

Yes I know we as HPC admins can easily adjust job scripts to mpirun with N
equal to the number of cores on a node (etc).
In fact when I have worked with users and showed them how to do this it has
been a source of satisfaction to me.




































































On 8 June 2018 at 09:21, Chris Samuel <chris at csamuel.org> wrote:

> Hi all,
>
> I'm curious to know what/how/where/if sites do to try and reduce the
> impact of
> fragmentation of resources by small/narrow jobs on systems where you also
> have
> to cope with large/wide parallel jobs?
>
> For my purposes a small/narrow job is anything that will fit on one node
> (whether a single core job, multi-threaded or MPI).
>
> One thing we're considering is to use overlapping partitions in Slurm to
> have
> a subset of nodes that are available to these types of jobs and then have
> large parallel jobs use a partition that can access any node.
>
> This has the added benefit of letting us set a higher priority on that
> partition to let Slurm try and place those jobs first, before smaller ones.
>
> We're already using a similar scheme for GPU jobs where they get put into
> a
> partition that can access all 36 cores on a node whereas non-GPU jobs get
> put
> into a partition that can only access 32 cores on a node, so effectively
> we
> reserve 4 cores a node for GPU jobs.
>
> But really I'm curious to know what people do about this, or do you not
> worry
> about it at all and just let the scheduler do its best?
>
> All the best,
> Chris
> --
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180608/431b8cbb/attachment.html>


More information about the Beowulf mailing list