[Beowulf] Cluster consistency checks
paul.mcintosh at monash.edu
Tue Mar 22 18:43:26 PDT 2016
Checking GPU's we use the following to determine if errors exist - things
seem a lot better now but in the past an ECC error was 99% a hardware issue
with the GPU or how it was plugged in...
nvidia-smi -a --xml-format | grep -A 33 "<ecc_errors>" | grep "<total>" |
grep -v "<total>0</total>"
Obviously it could be done more nicely and there are other bits of info you
can get at (e.g. driver versions etc).
From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Peter
Sent: Wednesday, 23 March 2016 4:08 AM
To: Olli-Pekka Lehto <olli-pekka.lehto at csc.fi>
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] Cluster consistency checks
On Tue, 22 Mar 2016 17:32:40 +0200 (EET) Olli-Pekka Lehto
<olli-pekka.lehto at csc.fi> wrote:
> I finally got around to writing down my cluster-consistency checklist
> that I've been planning for a long time:
Looks quite close to what we do. A few additions (randomly floating to the
* use dshbak / pshbak / dbuck to overview pdsh output (latter two from
* use conrep to read out bios settings from hp servers
* dmidecode -t memory can show dimm details
We also do most of this automatically in production with our
node-health-check suite (will catch bios settings, firmware, cpu and memory
> The goal is to try to make the baseline installation of a cluster as
> consistent as possible and make vendors work for their money. :) Of
> course hopefully publishing this will help vendors capture some of the
> issues that slip through the cracks even before clusters are handed
> over. It's also a good idea to run these types of checks during the
> lifetime of the system as there's always some consistency creep as
> hardware gets replaced.
> If someone is interested in contributing, pull requests or comments on
> the list are welcome. I'm sure that there's something missing as well.
> Right now it's just a text-file but making some nicer scripts and
> postprocessing for the output might happen as well at some point.
> All the examples are very HP oriented as well at this point.
> Best regards,
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To
change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf