[Beowulf] cluster deployment and config management

Wed Sep 6 02:46:19 PDT 2017

On 09/05/2017 07:14 PM, Stu Midgley wrote:
> I'm not feeling much love for puppet.

I'm pretty fond of puppet for managing clusters.  We use cobbler to go from PXE
boot -> installed, then puppet takes over.

Some of my favorite features:
* Inheritance is handy node -> node for a particular cluster -> compute node ->
  head node
* Tags for handling users is handy, 1200 users, dozen clusters, and various
  other bits of infrastructure makes it really easy to manage who gets access
  to what.
* I like the self healing aspect, defining the system state, not how to get
  there.  That way if I need to repurpose, patch, or mistakenly make a node
  unique in some way the next puppet run fixes it.
* Definitely helps with re-use across clusters.  Makes for a higher incentive
  to do it right the first time.
* Using facts to make decisions is really useful.  Things like detecting if you
  are a virtual machine, or updating autofs maps if IB is down.

> 
> On Wed, Sep 6, 2017 at 7:51 AM, Christopher Samuel <samuel at unimelb.edu.au
> <mailto:samuel at unimelb.edu.au>> wrote:
> 
>     On 05/09/17 15:24, Stu Midgley wrote:
> 
>     > I am in the process of redeveloping our cluster deployment and config
>     > management environment and wondered what others are doing?
> 
>     xCAT here for all HPC related infrastructure.  Stateful installs for
>     GPFS NSD servers and TSM servers, compute nodes are all statelite, so a
>     immutable RAMdisk image is built on the management node for the compute
>     cluster and then on boot they mount various items over NFS (including
>     the GPFS state directory).
> 
>     Nothing like your scale, of course, but it works and we know if a node
>     has booted a particular image it will be identical to any other node
>     that's set to boot the same image.
> 
>     Healthcheck scripts mark nodes offline if they don't have the current
>     production kernel and GPFS versions (and other checks too of course)
>     plus Slurm's "scontrol reboot" lets us do rolling reboots without
>     needing to spot when nodes have become idle.
> 
>     I've got to say I really prefer this to systems like Puppet, Salt, etc,
>     where you need to go and tweak an image after installation.
> 
>     For our VM infrastructure (web servers, etc) we do use Salt for that. We
>     used to use Puppet but we switched when the only person who understood
>     it left.  Don't miss it at all...
> 
>     cheers,
>     Chris
>     --
>      Christopher Samuel        Senior Systems Administrator
>      Melbourne Bioinformatics - The University of Melbourne
>      Email: samuel at unimelb.edu.au <mailto:samuel at unimelb.edu.au> Phone: +61 (0)3
>     903 55545 <tel:%2B61%20%280%293%20903%2055545>
> 
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org <mailto:Beowulf at beowulf.org>
>     sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>     <http://www.beowulf.org/mailman/listinfo/beowulf>
> 
> 
> 
> 
> -- 
> Dr Stuart Midgley
> sdm900 at gmail.com <mailto:sdm900 at gmail.com>
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>