[Beowulf] Hyperthreading and 'OS jitter'

Christopher Samuel samuel at unimelb.edu.au
Tue Aug 1 22:36:49 PDT 2017


On 02/08/17 13:37, Evan Burness wrote:

> Thanks for the history lessons, Chris! Very interesting indeed.

My pleasure, to add to the history here's a paper from the APAC'05
conference 12 years ago that details how the then APAC (now NCI) set up
their SGI Altix cluster, including a discussion on cpusets.

http://www.kev.pulo.com.au/publications/apac05/apac05-apacnf-altix.pdf

Also includes an interesting section on dealing with SGI's proprietary
MPI stack and the problems it caused them.

> Would be interesting to take it a step further and measure what the
> impacts (good, bad, or otherwise) of picking a specific core on a given
> CPU uArch layout for the OS.

Sadly I was hoping that document would give some indication of the
benefits of reducing jitter via cpusets, but it does not.

I'd be very interested to hear what people have found there - I do know
that Slurm allows you to reserve cores to generic resources like GPUs so
that an administrator can enforce that only certain cores can access
that resource (say the cores closest to a GPU).

https://slurm.schedmd.com/gres.html

It also supports "core specialisation" which is nebulously explained as:

https://slurm.schedmd.com/core_spec.html

# Core specialization is a feature designed to isolate system overhead
# (system interrupts, etc.) to designated cores on a compute node. This
# can reduce applications interrupts ranks to improve completion time.
# The job will be charged for all allocated cores, but will not be able
# to directly use the specialized cores.

Usefully there is a PDF from the 2014 Slurm User Group which goes into
more details about it, and includes references to work done by Cray and
others into the issues about jitter and benefits from reducing it.

https://slurm.schedmd.com/SUG14/process_isolation.pdf

>From that description it appears to only put the Slurm daemons for jobs
into the group, but of course there would be nothing to stop you having
a start up script that moved any other existing processes onto that core
first via their own cgroup.

Shame that Bull's test was too small to show any benefit!

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545



More information about the Beowulf mailing list