[Beowulf] Cluster consistency checks
Douglas Eadline
deadline at eadline.org
Sat Mar 26 06:47:33 PDT 2016
>
>
> On 23/03/16 14:55, Douglas Eadline wrote:
>>> Thanks for the kind words and comments! Good catch with HPL. It's
>>> definitely part of the test regime. I typically run 3 tests for
>>> consistency:
>>>
>>> - Separate instance of STREAM2 on each node
>>> - Separate instance of HPL on each node
>>> - Simple MPI latency / bandwidth test called mpisweep that tests every
>>> link (I'll put this up on github later as well)
>
> Any reference to mpisweep yet?
>
> Google didn't give me much...
>
>>>
>>> I now made the changes to the document.
>>>
>>> After this set of tests I'm not completely sure if NPB will add any
>>> further information. Those 3 benchmarks combined with the other checks
>>> should pretty much expose all the possible issues. However, I could be
>>> missing something again :)
>> NAS will verify the results. On several occasion I have
>> found NAS gave good numbers but the results did not verify.
>> This allowed me to look at lower level issues until I found
>> the problem (in one case a cable IIRC)
>>
>> BTW, I run NAS all the time to test performance and make sure
>> things are running properly on my deskside clusters. I have done
>> it so often I can tell which test is running by watching wwtop
>> (Warewulf cluster based top that shows loads, net, memory but no
>> application names).
>
> Isn't it time someone puts together all of these nice tests in a GitHub
> repo, or least some scripts/framework around each of these to
> build/install/run/verify them with as minimal effort as possible?
>
> I already know your answer: "why don't you?".
> Well, I may, some day, but who want want to help out? Any brave souls?
>
>
Along time ago on a cluster far far away...
http://www.clustermonkey.net/Benchmarking-Methods/a-tool-for-cluster-performance-tuning-and-optimization.html
I have not given this any attention in recent years.
I wanted to add more tests and clean up the code.
I do use my NAS script (that also needs cleaning up)
to run multiple MPI/compiler/nodes/size combination
all the time. It contains a lot of historical cruft.
--
Doug
> K.
>>
>> --
>> Doug
>>
>>> Best regards,
>>> O-P
>>> --
>>> Olli-Pekka Lehto
>>> Development Manager
>>> Computing Platforms
>>> CSC - IT Center for Science Ltd.
>>> E-Mail: olli-pekka.lehto at csc.fi
>>> Tel: +358 50 381 8604
>>> skype: oplehto // twitter: ople
>>>
>>>> From: "Jeffrey Layton" <laytonjb at gmail.com>
>>>> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi>
>>>> Cc: beowulf at beowulf.org
>>>> Sent: Tuesday, 22 March, 2016 16:45:20
>>>> Subject: Re: [Beowulf] Cluster consistency checks
>>>> Olli-Pekka,
>>>> Very nice - I'm glad you put a list down. Many of the things that I do
>>>> are based
>>>> on experience.
>>>> A long time ago, in one of my previous jobs, we used to run NAS
>>>> Parallel
>>>> Benchmark (NPB) on single nodes to get a baseline of performance. We
>>>> would look
>>>> for outliers and triage and debug them based on these results. We're
>>>> not
>>>> running the test for performance but to make sure the cluster was a
>>>> homogeneous
>>>> as possible. Have you done this before?
>>>> I've also seen people run HPL on single nodes and look for outliers.
>>>> After
>>>> triaging these, HPL is run on smaller groups of nodes within a single
>>>> switch,
>>>> look for outliers and triage them. This continues up to the entire
>>>> system. The
>>>> point is not to get a great HPL number to submit to the Top500 but
>>>> rather to
>>>> find potential network issues, particularly network links.
>>>> Thanks for the good work!
>>>> Jeff
>>>> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
>>>> olli-pekka.lehto at csc.fi >
>>>> wrote:
>>>>> Hi,
>>>>> I finally got around to writing down my cluster-consistency checklist
>>>>> that I've
>>>>> been planning for a long time:
>>>>> https://github.com/oplehto/cluster-checks/
>>>>> The goal is to try to make the baseline installation of a cluster as
>>>>> consistent
>>>>> as possible and make vendors work for their money. :) Of course
>>>>> hopefully
>>>>> publishing this will help vendors capture some of the issues that
>>>>> slip
>>>>> through
>>>>> the cracks even before clusters are handed over. It's also a good
>>>>> idea
>>>>> to run
>>>>> these types of checks during the lifetime of the system as there's
>>>>> always some
>>>>> consistency creep as hardware gets replaced.
>>>>> If someone is interested in contributing, pull requests or comments
>>>>> on
>>>>> the list
>>>>> are welcome. I'm sure that there's something missing as well. Right
>>>>> now
>>>>> it's
>>>>> just a text-file but making some nicer scripts and postprocessing for
>>>>> the
>>>>> output might happen as well at some point. All the examples are very
>>>>> HP
>>>>> oriented as well at this point.
>>>>> Best regards,
>>>>> Olli-Pekka
>>>>> --
>>>>> Olli-Pekka Lehto
>>>>> Development Manager
>>>>> Computing Platforms
>>>>> CSC - IT Center for Science Ltd.
>>>>> E-Mail: olli-pekka.lehto at csc.fi
>>>>> Tel: +358 50 381 8604
>>>>> skype: oplehto // twitter: ople
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>> Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> --
>>> Mailscanner: Clean
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>> --
>> Doug
>>
>
>
> --
> Mailscanner: Clean
>
--
Doug
--
Mailscanner: Clean
More information about the Beowulf
mailing list