[Beowulf] cluster deployment and config management

Tue Sep 5 16:51:46 PDT 2017

On 05/09/17 15:24, Stu Midgley wrote:

> I am in the process of redeveloping our cluster deployment and config
> management environment and wondered what others are doing?

xCAT here for all HPC related infrastructure.  Stateful installs for
GPFS NSD servers and TSM servers, compute nodes are all statelite, so a
immutable RAMdisk image is built on the management node for the compute
cluster and then on boot they mount various items over NFS (including
the GPFS state directory).

Nothing like your scale, of course, but it works and we know if a node
has booted a particular image it will be identical to any other node
that's set to boot the same image.

Healthcheck scripts mark nodes offline if they don't have the current
production kernel and GPFS versions (and other checks too of course)
plus Slurm's "scontrol reboot" lets us do rolling reboots without
needing to spot when nodes have become idle.

I've got to say I really prefer this to systems like Puppet, Salt, etc,
where you need to go and tweak an image after installation.

For our VM infrastructure (web servers, etc) we do use Salt for that. We
used to use Puppet but we switched when the only person who understood
it left.  Don't miss it at all...

cheers,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545