[Beowulf] Cluster consistency checks
Peter Kjellström
cap at nsc.liu.se
Tue Mar 22 10:08:16 PDT 2016
On Tue, 22 Mar 2016 17:32:40 +0200 (EET)
Olli-Pekka Lehto <olli-pekka.lehto at csc.fi> wrote:
> Hi,
>
> I finally got around to writing down my cluster-consistency checklist
> that I've been planning for a long time:
>
> https://github.com/oplehto/cluster-checks/
Looks quite close to what we do. A few additions (randomly floating to
the top):
* use dshbak / pshbak / dbuck to overview pdsh output (latter two from
https://www.nsc.liu.se/~kent/python-hostlist/)
* use conrep to read out bios settings from hp servers
* dmidecode -t memory can show dimm details
We also do most of this automatically in production with our
node-health-check suite (will catch bios settings, firmware, cpu and
memory performance, ...).
/Peter K
> The goal is to try to make the baseline installation of a cluster as
> consistent as possible and make vendors work for their money. :) Of
> course hopefully publishing this will help vendors capture some of
> the issues that slip through the cracks even before clusters are
> handed over. It's also a good idea to run these types of checks
> during the lifetime of the system as there's always some consistency
> creep as hardware gets replaced.
>
> If someone is interested in contributing, pull requests or comments
> on the list are welcome. I'm sure that there's something missing as
> well. Right now it's just a text-file but making some nicer scripts
> and postprocessing for the output might happen as well at some point.
> All the examples are very HP oriented as well at this point.
>
> Best regards,
> Olli-Pekka
More information about the Beowulf
mailing list