[Beowulf] Cluster consistency checks
deadline at eadline.org
Wed Mar 23 06:55:27 PDT 2016
> Thanks for the kind words and comments! Good catch with HPL. It's
> definitely part of the test regime. I typically run 3 tests for
> - Separate instance of STREAM2 on each node
> - Separate instance of HPL on each node
> - Simple MPI latency / bandwidth test called mpisweep that tests every
> link (I'll put this up on github later as well)
> I now made the changes to the document.
> After this set of tests I'm not completely sure if NPB will add any
> further information. Those 3 benchmarks combined with the other checks
> should pretty much expose all the possible issues. However, I could be
> missing something again :)
NAS will verify the results. On several occasion I have
found NAS gave good numbers but the results did not verify.
This allowed me to look at lower level issues until I found
the problem (in one case a cable IIRC)
BTW, I run NAS all the time to test performance and make sure
things are running properly on my deskside clusters. I have done
it so often I can tell which test is running by watching wwtop
(Warewulf cluster based top that shows loads, net, memory but no
> Best regards,
> Olli-Pekka Lehto
> Development Manager
> Computing Platforms
> CSC - IT Center for Science Ltd.
> E-Mail: olli-pekka.lehto at csc.fi
> Tel: +358 50 381 8604
> skype: oplehto // twitter: ople
>> From: "Jeffrey Layton" <laytonjb at gmail.com>
>> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi>
>> Cc: beowulf at beowulf.org
>> Sent: Tuesday, 22 March, 2016 16:45:20
>> Subject: Re: [Beowulf] Cluster consistency checks
>> Very nice - I'm glad you put a list down. Many of the things that I do
>> are based
>> on experience.
>> A long time ago, in one of my previous jobs, we used to run NAS Parallel
>> Benchmark (NPB) on single nodes to get a baseline of performance. We
>> would look
>> for outliers and triage and debug them based on these results. We're not
>> running the test for performance but to make sure the cluster was a
>> as possible. Have you done this before?
>> I've also seen people run HPL on single nodes and look for outliers.
>> triaging these, HPL is run on smaller groups of nodes within a single
>> look for outliers and triage them. This continues up to the entire
>> system. The
>> point is not to get a great HPL number to submit to the Top500 but
>> rather to
>> find potential network issues, particularly network links.
>> Thanks for the good work!
>> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
>> olli-pekka.lehto at csc.fi >
>>> I finally got around to writing down my cluster-consistency checklist
>>> that I've
>>> been planning for a long time:
>>> The goal is to try to make the baseline installation of a cluster as
>>> as possible and make vendors work for their money. :) Of course
>>> publishing this will help vendors capture some of the issues that slip
>>> the cracks even before clusters are handed over. It's also a good idea
>>> to run
>>> these types of checks during the lifetime of the system as there's
>>> always some
>>> consistency creep as hardware gets replaced.
>>> If someone is interested in contributing, pull requests or comments on
>>> the list
>>> are welcome. I'm sure that there's something missing as well. Right now
>>> just a text-file but making some nicer scripts and postprocessing for
>>> output might happen as well at some point. All the examples are very HP
>>> oriented as well at this point.
>>> Best regards,
>>> Olli-Pekka Lehto
>>> Development Manager
>>> Computing Platforms
>>> CSC - IT Center for Science Ltd.
>>> E-Mail: olli-pekka.lehto at csc.fi
>>> Tel: +358 50 381 8604
>>> skype: oplehto // twitter: ople
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> To change your subscription (digest mode or unsubscribe) visit
> Mailscanner: Clean
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf