<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Yeah this one is tricky.  In general we take the wildwest

      approach here, but I've had users use --contiguous and their job

      takes forever to run.</p>

    <p>I suppose one method would would be enforce that each job take a

      full node and parallel jobs always have contiguous.  As I recall

      Slurm will preferentially fill up nodes to try to leave as large

      of contiguous blocks as it can.</p>

    <p>The other other option would be to use requeue to your

      advantage.  Namely just have a high priority queue only for large

      contiguous jobs and it just requeues all the jobs it needs to to

      run.  That would depend on your single node/core users tolerances

      for being requeued.</p>

    <p>-Paul Edmon-<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 06/08/2018 03:55 AM, John Hearns via

      Beowulf wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAPqNE2WWmffswL6VBtf1awzOjJssY0fAn7N2NvPCav=2_7Y54g@mail.gmail.com">

      <div dir="ltr">

        <div>Chris, good question. I can't give a direct asnwer there,

          but let me share my experiences.</div>

        <div><br>

        </div>

        <div>In the past I managed SGI ICE clusters and a large memory

          UV system with PBSPro queuing.</div>

        <div>The engineers submitted CFD solver jobs using scripts, and

          we only allowed them to use a multiple of N cpus,</div>

        <div>in fact there were queues named after lets say 2N or 4N cpu

          cores. The number of cores were cunningly arranged to fit into

          <br>

        </div>

        <div>what SGI term an IRU, or everyone else would call a blade

          chassis.</div>

        <div>We had job exclusivity, and engineers were not allowed to

          choose how many CPUs they used.</div>

        <div>This is a very efficient way to run HPC - as you have a

          clear view of how many jobs fit on a cluster.</div>

        <div><br>

        </div>

        <div>Yes, before you say it this does not cater for the mixed

          workload with lots of single CPU jobs, Matlab, Python etc....</div>

        <div><br>

        </div>

        <div>When the UV arrived I configured bladesets (placement sets)

          such that the scheduler tried to allocate CPUs and memory from

          blades</div>

        <div>adjacent to each other. Again much better efficiency. If

          I'm not wrong you do that in Slurm by defining switches.</div>

        <div><br>

        </div>

        <div>When the high core count AMDs came along again I configured

          blade sets and the number of CPUs per job was increased to

          cope with</div>

        <div>larger core count CPUs but again cunningly arranged to

          equal the number of cores in a placement set (placement sets

          were configured to be</div>

        <div>half, full or two IRUs)</div>

        <div><br>

        </div>

        <div>At another place of employment recently we had a hugely

          mixed workload, ranging from interactive graphics, to the

          Matlab type jobs,</div>

        <div>to multinode CFD jobs. In addition to that we had different

          CPU generations and GPUs in the mix.</div>

        <div>That setup was a lot harder to manage and keep up the

          efficiency of use, as you can imagine.</div>

        <div><br>

        </div>

        <div>I agree with you about the overlapping partitions. If I was

          to arrange things in my ideal worls, I would have a set of the

          latest generation CPUs</div>

        <div>using the latest generation interconnect and reserve them

          for 'job exclusive' jobs - ie parallel jobs, and leave other

          nodes exclusively for</div>

        <div>one node or one core jobs.</div>

        <div>Then have some mechanism to grow/shrink the partitions.</div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div>Ont thing again which I found difficult in my last job was

          users 'hard wiring' the number of CPUs they use. In fact I

          have seen that quite often on other projects.</div>

        <div>What happens is that a new Phd or Postdoc or new engineer

          is gifted a job submission script from someone who is leaving,

          or moving on.</div>

        <div>The new person doesnt really understand why (say) six nodes

          with eight CPU cores are requested.</div>

        <div>But (a) they just want to get on and do the job (b) they

          are scared of breaking things by altering the script.</div>

        <div>So the number of CPUs doesnt change and with the latest

          generation 20plus cores on a node you get wasted cores.</div>

        <div>Also having mixed generations of CPUs with different core

          counts does not help here.<br>

        </div>

        <div><br>

        </div>

        <div>Yes I know we as HPC admins can easily adjust job scripts

          to mpirun with N equal to the number of cores on a node (etc).</div>

        <div>In fact when I have worked with users and showed them how

          to do this it has been a source of satisfaction to me.<br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On 8 June 2018 at 09:21, Chris Samuel <span

            dir="ltr"><<a href="mailto:chris@csamuel.org"

              target="_blank" moz-do-not-send="true">chris@csamuel.org</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

            <br>

            I'm curious to know what/how/where/if sites do to try and

            reduce the impact of <br>

            fragmentation of resources by small/narrow jobs on systems

            where you also have <br>

            to cope with large/wide parallel jobs?<br>

            <br>

            For my purposes a small/narrow job is anything that will fit

            on one node <br>

            (whether a single core job, multi-threaded or MPI).<br>

            <br>

            One thing we're considering is to use overlapping partitions

            in Slurm to have <br>

            a subset of nodes that are available to these types of jobs

            and then have <br>

            large parallel jobs use a partition that can access any

            node.<br>

            <br>

            This has the added benefit of letting us set a higher

            priority on that <br>

            partition to let Slurm try and place those jobs first,

            before smaller ones.<br>

            <br>

            We're already using a similar scheme for GPU jobs where they

            get put into a <br>

            partition that can access all 36 cores on a node whereas

            non-GPU jobs get put <br>

            into a partition that can only access 32 cores on a node, so

            effectively we <br>

            reserve 4 cores a node for GPU jobs.<br>

            <br>

            But really I'm curious to know what people do about this, or

            do you not worry <br>

            about it at all and just let the scheduler do its best?<br>

            <br>

            All the best,<br>

            Chris<br>

            <span class="HOEnZb"><font color="#888888">-- <br>

                 Chris Samuel  :  <a href="http://www.csamuel.org/"

                  rel="noreferrer" target="_blank"

                  moz-do-not-send="true">http://www.csamuel.org/</a>  : 

                Melbourne, VIC<br>

                <br>

                ______________________________<wbr>_________________<br>

                Beowulf mailing list, <a

                  href="mailto:Beowulf@beowulf.org"

                  moz-do-not-send="true">Beowulf@beowulf.org</a>

                sponsored by Penguin Computing<br>

                To change your subscription (digest mode or unsubscribe)

                visit <a

                  href="http://www.beowulf.org/mailman/listinfo/beowulf"

                  rel="noreferrer" target="_blank"

                  moz-do-not-send="true">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>

              </font></span></blockquote>

        </div>

        <br>

      </div>

      <!--'"--><br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing

To change your subscription (digest mode or unsubscribe) visit <a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>