[Beowulf] Cluster consistency checks
olli-pekka.lehto at csc.fi
Fri Mar 25 14:37:17 PDT 2016
Thanks for all the feedback. I'll incorporate the other suggestion over the long weekend as well.
My method for looking at outliers have been mostly "pipe through sort and look for the obvious problem nodes". The outliers tend to be pretty clear as they deviate 20-50% from the worst and typically there's just a handful. Based on experience I'd say ~0.1-1% of the node population on a fresh system.
I don't have them at hand but I could try to dig out some of the reference runs from our systems and it would be interesting if some others have similar results as well.
CSC - IT Center for Science Ltd.
E-Mail: olli-pekka.lehto at csc.fi
Tel: +358 50 381 8604
skype: oplehto // twitter: ople
> From: "Jeffrey Layton" <laytonjb at gmail.com>
> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi>, beowulf at beowulf.org
> Sent: Friday, 25 March, 2016 20:39:12
> Subject: Re: [Beowulf] Cluster consistency checks
> Olli-Pekka, et al,
> I took a look at your updated website - it looks very good. One thing I wanted
> to ask, and this question is probably one for the entire list, when you run a
> test across all of the nodes in the cluster, what process do you use to
> determine if nodes are "outliers" and need attention?
> For example, one test you mention is to run stream and look at the TRIAD results
> for all of the nodes. If you run it across an entire cluster you end up with a
> collection of results. What do you do with those results? Do you look for nodes
> that are a certain percentage outside of the mean? Or do you look for nodes
> that are outside one standard deviation from the mean?
> P.S. I have my own ideas but I'm really curious what other people do.
> On Wed, Mar 23, 2016 at 9:55 AM, Douglas Eadline < deadline at eadline.org > wrote:
>> > Thanks for the kind words and comments! Good catch with HPL. It's
>> > definitely part of the test regime. I typically run 3 tests for
>> > consistency:
>> > - Separate instance of STREAM2 on each node
>> > - Separate instance of HPL on each node
>> > - Simple MPI latency / bandwidth test called mpisweep that tests every
>> > link (I'll put this up on github later as well)
>> > I now made the changes to the document.
>> > After this set of tests I'm not completely sure if NPB will add any
>> > further information. Those 3 benchmarks combined with the other checks
>> > should pretty much expose all the possible issues. However, I could be
>> > missing something again :)
>> NAS will verify the results. On several occasion I have
>> found NAS gave good numbers but the results did not verify.
>> This allowed me to look at lower level issues until I found
>> the problem (in one case a cable IIRC)
>> BTW, I run NAS all the time to test performance and make sure
>> things are running properly on my deskside clusters. I have done
>> it so often I can tell which test is running by watching wwtop
>> (Warewulf cluster based top that shows loads, net, memory but no
>> application names).
>> > Best regards,
>> > O-P
>> > --
>> > Olli-Pekka Lehto
>> > Development Manager
>> > Computing Platforms
>> > CSC - IT Center for Science Ltd.
>> > E-Mail: olli-pekka.lehto at csc.fi
>> > Tel: +358 50 381 8604
>> > skype: oplehto // twitter: ople
>> >> From: "Jeffrey Layton" < laytonjb at gmail.com >
>> >> To: "Olli-Pekka Lehto" < olli-pekka.lehto at csc.fi >
>> >> Cc: beowulf at beowulf.org
>> >> Sent: Tuesday, 22 March, 2016 16:45:20
>> >> Subject: Re: [Beowulf] Cluster consistency checks
>> >> Olli-Pekka,
>> >> Very nice - I'm glad you put a list down. Many of the things that I do
>> >> are based
>> >> on experience.
>> >> A long time ago, in one of my previous jobs, we used to run NAS Parallel
>> >> Benchmark (NPB) on single nodes to get a baseline of performance. We
>> >> would look
>> >> for outliers and triage and debug them based on these results. We're not
>> >> running the test for performance but to make sure the cluster was a
>> >> homogeneous
>> >> as possible. Have you done this before?
>> >> I've also seen people run HPL on single nodes and look for outliers.
>> >> After
>> >> triaging these, HPL is run on smaller groups of nodes within a single
>> >> switch,
>> >> look for outliers and triage them. This continues up to the entire
>> >> system. The
>> >> point is not to get a great HPL number to submit to the Top500 but
>> >> rather to
>> >> find potential network issues, particularly network links.
>> >> Thanks for the good work!
>> >> Jeff
>> >> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
>> >> olli-pekka.lehto at csc.fi >
>> >> wrote:
>> >>> Hi,
>> >>> I finally got around to writing down my cluster-consistency checklist
>> >>> that I've
>> >>> been planning for a long time:
>> >>> https://github.com/oplehto/cluster-checks/
>> >>> The goal is to try to make the baseline installation of a cluster as
>> >>> consistent
>> >>> as possible and make vendors work for their money. :) Of course
>> >>> hopefully
>> >>> publishing this will help vendors capture some of the issues that slip
>> >>> through
>> >>> the cracks even before clusters are handed over. It's also a good idea
>> >>> to run
>> >>> these types of checks during the lifetime of the system as there's
>> >>> always some
>> >>> consistency creep as hardware gets replaced.
>> >>> If someone is interested in contributing, pull requests or comments on
>> >>> the list
>> >>> are welcome. I'm sure that there's something missing as well. Right now
>> >>> it's
>> >>> just a text-file but making some nicer scripts and postprocessing for
>> >>> the
>> >>> output might happen as well at some point. All the examples are very HP
>> >>> oriented as well at this point.
>> >>> Best regards,
>> >>> Olli-Pekka
>> >>> --
>> >>> Olli-Pekka Lehto
>> >>> Development Manager
>> >>> Computing Platforms
>> >>> CSC - IT Center for Science Ltd.
>> >>> E-Mail: olli-pekka.lehto at csc.fi
>> >>> Tel: +358 50 381 8604
>> >>> skype: oplehto // twitter: ople
>> >>> _______________________________________________
>> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> >>> Computing
>> >>> To change your subscription (digest mode or unsubscribe) visit
>> >>> http://www.beowulf.org/mailman/listinfo/beowulf
>> > --
>> > Mailscanner: Clean
>> > _______________________________________________
>> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> > To change your subscription (digest mode or unsubscribe) visit
>> > http://www.beowulf.org/mailman/listinfo/beowulf
>> Mailscanner: Clean
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf