[Beowulf] Cluster consistency checks

Kenneth Hoste kenneth.hoste at ugent.be
Fri Mar 25 15:15:05 PDT 2016



On 23/03/16 14:55, Douglas Eadline wrote:
>> Thanks for the kind words and comments! Good catch with HPL. It's
>> definitely part of the test regime. I typically run 3 tests for
>> consistency:
>>
>> - Separate instance of STREAM2 on each node
>> - Separate instance of HPL on each node
>> - Simple MPI latency / bandwidth test called mpisweep that tests every
>> link (I'll put this up on github later as well)

Any reference to mpisweep yet?

Google didn't give me much...

>>
>> I now made the changes to the document.
>>
>> After this set of tests I'm not completely sure if NPB will add any
>> further information. Those 3 benchmarks combined with the other checks
>> should pretty much expose all the possible issues. However, I could be
>> missing something again :)
> NAS will verify the results. On several occasion I have
> found NAS gave good numbers but the results did not verify.
> This allowed me to look at lower level issues until I found
> the problem (in one case a cable IIRC)
>
> BTW, I run NAS all the time to test performance and make sure
> things are running properly on my deskside clusters. I have done
> it so often I can tell which test is running by watching wwtop
> (Warewulf cluster based top that shows loads, net, memory but no
> application names).

Isn't it time someone puts together all of these nice tests in a GitHub 
repo, or least some scripts/framework around each of these to 
build/install/run/verify them with as minimal effort as possible?

I already know your answer: "why don't you?".
Well, I may, some day, but who want want to help out? Any brave souls?


K.
>
> --
> Doug
>
>> Best regards,
>> O-P
>> --
>> Olli-Pekka Lehto
>> Development Manager
>> Computing Platforms
>> CSC - IT Center for Science Ltd.
>> E-Mail: olli-pekka.lehto at csc.fi
>> Tel: +358 50 381 8604
>> skype: oplehto // twitter: ople
>>
>>> From: "Jeffrey Layton" <laytonjb at gmail.com>
>>> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi>
>>> Cc: beowulf at beowulf.org
>>> Sent: Tuesday, 22 March, 2016 16:45:20
>>> Subject: Re: [Beowulf] Cluster consistency checks
>>> Olli-Pekka,
>>> Very nice - I'm glad you put a list down. Many of the things that I do
>>> are based
>>> on experience.
>>> A long time ago, in one of my previous jobs, we used to run NAS Parallel
>>> Benchmark (NPB) on single nodes to get a baseline of performance. We
>>> would look
>>> for outliers and triage and debug them based on these results. We're not
>>> running the test for performance but to make sure the cluster was a
>>> homogeneous
>>> as possible. Have you done this before?
>>> I've also seen people run HPL on single nodes and look for outliers.
>>> After
>>> triaging these, HPL is run on smaller groups of nodes within a single
>>> switch,
>>> look for outliers and triage them. This continues up to the entire
>>> system. The
>>> point is not to get a great HPL number to submit to the Top500 but
>>> rather to
>>> find potential network issues, particularly network links.
>>> Thanks for the good work!
>>> Jeff
>>> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
>>> olli-pekka.lehto at csc.fi >
>>> wrote:
>>>> Hi,
>>>> I finally got around to writing down my cluster-consistency checklist
>>>> that I've
>>>> been planning for a long time:
>>>> https://github.com/oplehto/cluster-checks/
>>>> The goal is to try to make the baseline installation of a cluster as
>>>> consistent
>>>> as possible and make vendors work for their money. :) Of course
>>>> hopefully
>>>> publishing this will help vendors capture some of the issues that slip
>>>> through
>>>> the cracks even before clusters are handed over. It's also a good idea
>>>> to run
>>>> these types of checks during the lifetime of the system as there's
>>>> always some
>>>> consistency creep as hardware gets replaced.
>>>> If someone is interested in contributing, pull requests or comments on
>>>> the list
>>>> are welcome. I'm sure that there's something missing as well. Right now
>>>> it's
>>>> just a text-file but making some nicer scripts and postprocessing for
>>>> the
>>>> output might happen as well at some point. All the examples are very HP
>>>> oriented as well at this point.
>>>> Best regards,
>>>> Olli-Pekka
>>>> --
>>>> Olli-Pekka Lehto
>>>> Development Manager
>>>> Computing Platforms
>>>> CSC - IT Center for Science Ltd.
>>>> E-Mail: olli-pekka.lehto at csc.fi
>>>> Tel: +358 50 381 8604
>>>> skype: oplehto // twitter: ople
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> --
>> Mailscanner: Clean
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> --
> Doug
>



More information about the Beowulf mailing list