[Beowulf] Cluster consistency checks

Peter Kjellström cap at nsc.liu.se
Tue Mar 22 10:08:16 PDT 2016


On Tue, 22 Mar 2016 17:32:40 +0200 (EET)
Olli-Pekka Lehto <olli-pekka.lehto at csc.fi> wrote:

> Hi, 
> 
> I finally got around to writing down my cluster-consistency checklist
> that I've been planning for a long time: 
> 
> https://github.com/oplehto/cluster-checks/ 

Looks quite close to what we do. A few additions (randomly floating to
the top):

* use dshbak / pshbak / dbuck to overview pdsh output (latter two from
   https://www.nsc.liu.se/~kent/python-hostlist/)
* use conrep to read out bios settings from hp servers
* dmidecode -t memory can show dimm details

We also do most of this automatically in production with our
node-health-check suite (will catch bios settings, firmware, cpu and
memory performance, ...).

/Peter K

> The goal is to try to make the baseline installation of a cluster as
> consistent as possible and make vendors work for their money. :) Of
> course hopefully publishing this will help vendors capture some of
> the issues that slip through the cracks even before clusters are
> handed over. It's also a good idea to run these types of checks
> during the lifetime of the system as there's always some consistency
> creep as hardware gets replaced. 
> 
> If someone is interested in contributing, pull requests or comments
> on the list are welcome. I'm sure that there's something missing as
> well. Right now it's just a text-file but making some nicer scripts
> and postprocessing for the output might happen as well at some point.
> All the examples are very HP oriented as well at this point. 
> 
> Best regards, 
> Olli-Pekka 



More information about the Beowulf mailing list