[Beowulf] Cluster consistency checks

Olli-Pekka Lehto olli-pekka.lehto at csc.fi
Sat Mar 26 12:16:59 PDT 2016


----- Original Message -----
> From: "Kenneth Hoste" <kenneth.hoste at ugent.be>
> To: "Douglas Eadline" <deadline at eadline.org>, "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi>
> Cc: beowulf at beowulf.org
> Sent: Saturday, 26 March, 2016 00:15:05
> Subject: Re: [Beowulf] Cluster consistency checks

> On 23/03/16 14:55, Douglas Eadline wrote:
>>> Thanks for the kind words and comments! Good catch with HPL. It's
>>> definitely part of the test regime. I typically run 3 tests for
>>> consistency:
>>>
>>> - Separate instance of STREAM2 on each node
>>> - Separate instance of HPL on each node
>>> - Simple MPI latency / bandwidth test called mpisweep that tests every
>>> link (I'll put this up on github later as well)
> 
> Any reference to mpisweep yet?
> 
> Google didn't give me much...
> 

That's an internal code I whipped up at some point. Pretty much the minimum viable program to do a sweep of all the connections. I'll try to clean it up a bit and put it up in the next few days. 

>>>
>>> I now made the changes to the document.
>>>
>>> After this set of tests I'm not completely sure if NPB will add any
>>> further information. Those 3 benchmarks combined with the other checks
>>> should pretty much expose all the possible issues. However, I could be
>>> missing something again :)
>> NAS will verify the results. On several occasion I have
>> found NAS gave good numbers but the results did not verify.
>> This allowed me to look at lower level issues until I found
>> the problem (in one case a cable IIRC)
>>
>> BTW, I run NAS all the time to test performance and make sure
>> things are running properly on my deskside clusters. I have done
>> it so often I can tell which test is running by watching wwtop
>> (Warewulf cluster based top that shows loads, net, memory but no
>> application names).
> 
> Isn't it time someone puts together all of these nice tests in a GitHub
> repo, or least some scripts/framework around each of these to
> build/install/run/verify them with as minimal effort as possible?
> 
> I already know your answer: "why don't you?".
> Well, I may, some day, but who want want to help out? Any brave souls?

One of my not-so-secret motivations in putting this up is to get someone to perhaps do something like this. :) 

I should look at the current state Intel Cluster Checker as well. A few years back it was lacking a lot of the checks I wanted so I stuck with my set of oneliners. 


> 
> K.
>>
>> --
>> Doug
>>
>>> Best regards,
>>> O-P
>>> --
>>> Olli-Pekka Lehto
>>> Development Manager
>>> Computing Platforms
>>> CSC - IT Center for Science Ltd.
>>> E-Mail: olli-pekka.lehto at csc.fi
>>> Tel: +358 50 381 8604
>>> skype: oplehto // twitter: ople
>>>
>>>> From: "Jeffrey Layton" <laytonjb at gmail.com>
>>>> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi>
>>>> Cc: beowulf at beowulf.org
>>>> Sent: Tuesday, 22 March, 2016 16:45:20
>>>> Subject: Re: [Beowulf] Cluster consistency checks
>>>> Olli-Pekka,
>>>> Very nice - I'm glad you put a list down. Many of the things that I do
>>>> are based
>>>> on experience.
>>>> A long time ago, in one of my previous jobs, we used to run NAS Parallel
>>>> Benchmark (NPB) on single nodes to get a baseline of performance. We
>>>> would look
>>>> for outliers and triage and debug them based on these results. We're not
>>>> running the test for performance but to make sure the cluster was a
>>>> homogeneous
>>>> as possible. Have you done this before?
>>>> I've also seen people run HPL on single nodes and look for outliers.
>>>> After
>>>> triaging these, HPL is run on smaller groups of nodes within a single
>>>> switch,
>>>> look for outliers and triage them. This continues up to the entire
>>>> system. The
>>>> point is not to get a great HPL number to submit to the Top500 but
>>>> rather to
>>>> find potential network issues, particularly network links.
>>>> Thanks for the good work!
>>>> Jeff
>>>> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
>>>> olli-pekka.lehto at csc.fi >
>>>> wrote:
>>>>> Hi,
>>>>> I finally got around to writing down my cluster-consistency checklist
>>>>> that I've
>>>>> been planning for a long time:
>>>>> https://github.com/oplehto/cluster-checks/
>>>>> The goal is to try to make the baseline installation of a cluster as
>>>>> consistent
>>>>> as possible and make vendors work for their money. :) Of course
>>>>> hopefully
>>>>> publishing this will help vendors capture some of the issues that slip
>>>>> through
>>>>> the cracks even before clusters are handed over. It's also a good idea
>>>>> to run
>>>>> these types of checks during the lifetime of the system as there's
>>>>> always some
>>>>> consistency creep as hardware gets replaced.
>>>>> If someone is interested in contributing, pull requests or comments on
>>>>> the list
>>>>> are welcome. I'm sure that there's something missing as well. Right now
>>>>> it's
>>>>> just a text-file but making some nicer scripts and postprocessing for
>>>>> the
>>>>> output might happen as well at some point. All the examples are very HP
>>>>> oriented as well at this point.
>>>>> Best regards,
>>>>> Olli-Pekka
>>>>> --
>>>>> Olli-Pekka Lehto
>>>>> Development Manager
>>>>> Computing Platforms
>>>>> CSC - IT Center for Science Ltd.
>>>>> E-Mail: olli-pekka.lehto at csc.fi
>>>>> Tel: +358 50 381 8604
>>>>> skype: oplehto // twitter: ople
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>> Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> --
>>> Mailscanner: Clean
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>> --
>> Doug


More information about the Beowulf mailing list