[Beowulf] Most common cluster management software, job schedulers, etc?
jeff.white at wsu.edu
Wed Mar 9 09:18:21 PST 2016
I'll throw in my $0.02 since I might be an oddball with how I build
On 03/07/2016 08:43 PM, Jeff Friedman wrote:
> Hello all. I am just entering the HPC Sales Engineering role, and
> would like to focus my learning on the most relevant stuff. I have
> searched near and far for a current survey of some sort listing the
> top used “stacks”, but cannot seem to find one that is free. I was
> breaking things down similar to this:
> _OS disto_: CentOS, Debian, TOSS, etc? I know some come trimmed
> down, and also include specific HPC libraries, like CNL, CNK, INK?
CentOS 7. In fact, the base OS for each of my nodes is created with just:
yum groups install "Compute Node" --releasever=7
... which is currently in ZFS and exported via NFSv4.
> _MPI options_: MPICH2, MVAPICH2, Open MPI, Intel MPI, ?
All of the above (pretty much whatever our users want us to install).
> _Provisioning software_: Cobbler, Warewulf, xCAT, Openstack, Platform
> HPC, ?
We started with xCAT but moved away for various reasons. Provisioning is
done without this type of management software in my cluster. I have a
simple Python script to configure a new node's DHCP, PXE boot file, and
NFS export (each node has its own writable root filesystem served to it
via NFS). It's designed to be as simple of an answer to "how can I PXE
boot CentOS?" as I could get.
> _Configuration management_: Warewulf, Puppet, Chef, Ansible, ?
SaltStack! This is what does the heavy lifting. Nodes boot with a very
generic CentOS image which only has 1 significant change from stock: a
Salt minion is installed. After a node boots, Salt takes over and
installs software, mounts remote filesystems, cooks dinner, starts
daemons, brings each node into the scheduler, etc. I don't maintain
"node images" I maintain Salt states that do all the work after a node
> _Resource and job schedulers_: I think these are basically the same
> thing? Torque, Lava, Maui, Moab, SLURM, Grid Engine, Son of Grid
> Engine, Univa, Platform LSF, etc… others?
We briefly used Torque+MOAB before running away crying. We not use SLURM.
> _Shared filesystems_: NFS, pNFS, Lustre, GPFS, PVFS2, GlusterFS, ?
NFS (others in the future, we're looking at Ceph at the moment).
> _Library management_: Lmod, ?
> _Performance monitoring_: Ganglia, Nagios, ?
Ganglia and in the near future, Zabbix.
> _Cluster management toolkits_: I believe these perform many of the
> functions above, all wrapped up in one tool? Rocks, Oscar, Scyld,
> Bright, ?
> Does anyone have any observations as to which of the above are the
> most common? Or is that too broad? I believe most the clusters I
> will be involved with will be in the 128 - 2000 core range, all on
> commodity hardware.
> Thank you!
> - Jeff
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf.org_mailman_listinfo_beowulf&d=CwICAg&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk&m=DSX_lPBl-ddcSqZRPHfgBks9Qy7i-jNze66bDl8X10k&s=JbG5Mj7EJIXkC58c2hTufeu_GdjiqqNT7h3ubh0Za38&e=
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf