[Beowulf] Most common cluster management software, job schedulers, etc?

Thu Mar 10 00:33:51 PST 2016

Really interesting to see what stacks people are using! 

----- Original Message -----
> From: "Jeff Friedman" <jeff.friedman at siliconmechanics.com>
> To: beowulf at beowulf.org
> Sent: Tuesday, 8 March, 2016 06:43:59
> Subject: [Beowulf] Most common cluster management software, job schedulers,	etc?

> Hello all. I am just entering the HPC Sales Engineering role, and would like to
> focus my learning on the most relevant stuff. I have searched near and far for
> a current survey of some sort listing the top used “stacks”, but cannot seem to
> find one that is free. I was breaking things down similar to this:

> OS disto : CentOS, Debian, TOSS, etc? I know some come trimmed down, and also
> include specific HPC libraries, like CNL, CNK, INK?

CentOS. That (and RH) has the best coverage for driver support (InfiniBand, 
Lustre/GPFS, GPU, Xeon Phi) and ISV code compatibility. If this was not an issue
then I'd go with Debian. 

> MPI options : MPICH2, MVAPICH2, Open MPI, Intel MPI, ?

IntelMPI, OpenMPI, MVAPICH2

Good to have at least 2 stacks installed if one flakes out with a 
bug, it's straightforward to try the "secondary" one. 

> Provisioning software : Cobbler, Warewulf, xCAT, Openstack, Platform HPC, ?

> Configuration management : Warewulf, Puppet, Chef, Ansible, ?

Using Warewulf but moving towards having it as simple provisioner and using 
Ansible. Piloting it in a new project. Lots of playbooks available from 
our github: https://github.com/CSC-IT-Center-for-Science/fgci-ansible (YMMV)

We have a pretty big "general IT" server and cloud infrastructure so having a
non-HPC specific config management will hopefully create some synergies. 

> Resource and job schedulers : I think these are basically the same thing?
> Torque, Lava, Maui, Moab, SLURM, Grid Engine, Son of Grid Engine, Univa,
> Platform LSF, etc… others?

Moved everything to SLURM a few years back and not looking back :) Support
from SchedMD has been good. 

> Shared filesystems : NFS, pNFS, Lustre, GPFS, PVFS2, GlusterFS, ?

BeeGFS is gaining a lot of traction in the small-medium cluster space it seems. 
It was also recently open sourced. We use self-supported Lustre and Ceph for cloud.

It will be interesting to see how Ceph evolves in the high-performance space.  

> Library management : Lmod, ?

lmod

> Performance monitoring : Ganglia, Nagios, ?

- Collectd/Graphite/Grafana for system/infra metrics, 
- Nagios, OpsView for Nagios GUI (might move to Icinga or Sensu at some point), 
- ELK for log analytics
- OpenXDMoD for queue monitoring (looking at using SupreMM at the moment)
- Allinea Performance Reports for per-job analysis

> Cluster management toolkits : I believe these perform many of the functions
> above, all wrapped up in one tool? Rocks, Oscar, Scyld, Bright, ?

> Does anyone have any observations as to which of the above are the most common?
> Or is that too broad? I believe most the clusters I will be involved with will
> be in the 128 - 2000 core range, all on commodity hardware.

> Thank you!

> - Jeff

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf