[Beowulf] cluster deployment and config management

Tue Sep 12 09:16:25 PDT 2017

As Gavin mentioned there are options to configure for Ansible et al.

With any configuration management tool de jour you can configure your DHCP,
TFTP, and HTTP for PXE and then enforce a desired state on the nodes. I am
a developer on several configuration management tools and from the mailing
list the speed issues come from people doing things the hard way.

If you need an API, try Open Stack Ironic

Saltstack has a GUI if you need warm and fuzzy

Python Fabric + expect can do anything at speed but will require some
thinking.

On Tue, Sep 12, 2017 at 11:07 AM, Gavin W. Burris <bug at wharton.upenn.edu>
wrote:

> Hi, All.
>
> We use a minimal pxe kickstart for hardware nodes, then Ansible after
> that.  It is a thing of beauty to have the entire cluster defined in one
> git repo.  This also lends itself to configuring cloud node images with the
> exact same code.  Reusable roles and conditionals, FTW!
>
> With regards to scaling, Ansible will by default fork only 8 parallel
> processes.  This can be scaled way up, maybe hundreds at a time.  If there
> are thousands of states / idempotent plays to run on a single host, those
> are going to take some time regardless of the configuration language,
> correct?  A solution would be to tag up the plays and only run required
> tags during an update, versus a full run on fresh installs.  The fact
> caching feature may help here.  SSH accelerated mode or pipelining are
> newer feature, too, which will reduce the number of new connections
> required, a big time saver.
>
> Cheers.
>
> On Tue 09/05/17 02:57AM EDT, Carsten Aulbert wrote:
> > Hi
> >
> > On 09/05/17 08:43, Stu Midgley wrote:
> > > Interesting.  Ansible has come up a few times.
> > >
> > > Our largest cluster is 2000 KNL nodes and we are looking towards 10k...
> > > so it needs to scale well :)
> > >
> > We went with ansible at the end of 2015 until we hit a road block with
> > it not using a client daemon a fat ferew months. When having a few 1000
> > states to perform on each client, the lag for initiating the next state
> > centrally from the server was quite noticeable - in the end a single run
> > took more than half an hour without any changes (for a single host!).
> >
> > After that we re-evaluated with salt stack being the outcome scaling
> > well enough for our O(2500) clients.
> >
> > Note, I ave not tracked if and how ansible progressed over the past
> > ~2yrs which may or may not exhibit the same problems today.
> >
> > Cheers
> >
> > Carsten
> >
> > --
> > Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
> > Callinstraße 38, 30167 Hannover, Germany
> > Phone: +49 511 762 17185
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Gavin W. Burris
> Senior Project Leader for Research Computing
> The Wharton School
> University of Pennsylvania
> Search our documentation: http://research-it.wharton.upenn.edu/about/
> Subscribe to the Newsletter: http://whr.tn/ResearchNewsletterSubscribe
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
- Andrew "lathama" Latham lathama at gmail.com http://lathama.com
<http://lathama.org> -
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170912/782ba047/attachment.html>