[Beowulf] Cluster consistency checks

Paul McIntosh paul.mcintosh at monash.edu
Tue Mar 22 18:43:26 PDT 2016


Checking GPU's we use the following to determine if errors exist - things
seem a lot better now but in the past an ECC error was 99% a hardware issue
with the GPU or how it was plugged in...

nvidia-smi -a --xml-format | grep -A 33 "<ecc_errors>" | grep "<total>" |
grep -v "<total>0</total>"

Obviously it could be done more nicely and there are other bits of info you
can get at (e.g. driver versions etc).  

Paul

-----Original Message-----
From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Peter
Kjellström
Sent: Wednesday, 23 March 2016 4:08 AM
To: Olli-Pekka Lehto <olli-pekka.lehto at csc.fi>
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] Cluster consistency checks

On Tue, 22 Mar 2016 17:32:40 +0200 (EET) Olli-Pekka Lehto
<olli-pekka.lehto at csc.fi> wrote:

> Hi,
> 
> I finally got around to writing down my cluster-consistency checklist 
> that I've been planning for a long time:
> 
> https://github.com/oplehto/cluster-checks/

Looks quite close to what we do. A few additions (randomly floating to the
top):

* use dshbak / pshbak / dbuck to overview pdsh output (latter two from
   https://www.nsc.liu.se/~kent/python-hostlist/)
* use conrep to read out bios settings from hp servers
* dmidecode -t memory can show dimm details

We also do most of this automatically in production with our
node-health-check suite (will catch bios settings, firmware, cpu and memory
performance, ...).

/Peter K

> The goal is to try to make the baseline installation of a cluster as 
> consistent as possible and make vendors work for their money. :) Of 
> course hopefully publishing this will help vendors capture some of the 
> issues that slip through the cracks even before clusters are handed 
> over. It's also a good idea to run these types of checks during the 
> lifetime of the system as there's always some consistency creep as 
> hardware gets replaced.
> 
> If someone is interested in contributing, pull requests or comments on 
> the list are welcome. I'm sure that there's something missing as well. 
> Right now it's just a text-file but making some nicer scripts and 
> postprocessing for the output might happen as well at some point.
> All the examples are very HP oriented as well at this point. 
> 
> Best regards,
> Olli-Pekka

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To
change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list