[Beowulf] Most common cluster management software, job schedulers, etc?
olli-pekka.lehto at csc.fi
Thu Mar 10 00:33:51 PST 2016
Really interesting to see what stacks people are using!
----- Original Message -----
> From: "Jeff Friedman" <jeff.friedman at siliconmechanics.com>
> To: beowulf at beowulf.org
> Sent: Tuesday, 8 March, 2016 06:43:59
> Subject: [Beowulf] Most common cluster management software, job schedulers, etc?
> Hello all. I am just entering the HPC Sales Engineering role, and would like to
> focus my learning on the most relevant stuff. I have searched near and far for
> a current survey of some sort listing the top used “stacks”, but cannot seem to
> find one that is free. I was breaking things down similar to this:
> OS disto : CentOS, Debian, TOSS, etc? I know some come trimmed down, and also
> include specific HPC libraries, like CNL, CNK, INK?
CentOS. That (and RH) has the best coverage for driver support (InfiniBand,
Lustre/GPFS, GPU, Xeon Phi) and ISV code compatibility. If this was not an issue
then I'd go with Debian.
> MPI options : MPICH2, MVAPICH2, Open MPI, Intel MPI, ?
IntelMPI, OpenMPI, MVAPICH2
Good to have at least 2 stacks installed if one flakes out with a
bug, it's straightforward to try the "secondary" one.
> Provisioning software : Cobbler, Warewulf, xCAT, Openstack, Platform HPC, ?
> Configuration management : Warewulf, Puppet, Chef, Ansible, ?
Using Warewulf but moving towards having it as simple provisioner and using
Ansible. Piloting it in a new project. Lots of playbooks available from
our github: https://github.com/CSC-IT-Center-for-Science/fgci-ansible (YMMV)
We have a pretty big "general IT" server and cloud infrastructure so having a
non-HPC specific config management will hopefully create some synergies.
> Resource and job schedulers : I think these are basically the same thing?
> Torque, Lava, Maui, Moab, SLURM, Grid Engine, Son of Grid Engine, Univa,
> Platform LSF, etc… others?
Moved everything to SLURM a few years back and not looking back :) Support
from SchedMD has been good.
> Shared filesystems : NFS, pNFS, Lustre, GPFS, PVFS2, GlusterFS, ?
BeeGFS is gaining a lot of traction in the small-medium cluster space it seems.
It was also recently open sourced. We use self-supported Lustre and Ceph for cloud.
It will be interesting to see how Ceph evolves in the high-performance space.
> Library management : Lmod, ?
> Performance monitoring : Ganglia, Nagios, ?
- Collectd/Graphite/Grafana for system/infra metrics,
- Nagios, OpsView for Nagios GUI (might move to Icinga or Sensu at some point),
- ELK for log analytics
- OpenXDMoD for queue monitoring (looking at using SupreMM at the moment)
- Allinea Performance Reports for per-job analysis
> Cluster management toolkits : I believe these perform many of the functions
> above, all wrapped up in one tool? Rocks, Oscar, Scyld, Bright, ?
> Does anyone have any observations as to which of the above are the most common?
> Or is that too broad? I believe most the clusters I will be involved with will
> be in the 128 - 2000 core range, all on commodity hardware.
> Thank you!
> - Jeff
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf