[Beowulf] Cluster consistency checks

Eliot Eshelman eliote at microway.com
Fri Mar 25 11:56:17 PDT 2016


Someone on the MVAPICH mailing list posted this image a few years back. 
I find the style of visualization compelling:




On 03/25/2016 02:39 PM, Jeffrey Layton wrote:
>
> Olli-Pekka, et al,
>
> I took a look at your updated website - it looks very good. One thing 
> I wanted to ask, and this question is probably one for the entire 
> list, when you run a test across all of the nodes in the cluster, what 
> process do you use to determine if nodes are "outliers" and need 
> attention?
>
>
> For example, one test you mention is to run stream and look at the 
> TRIAD results for all of the nodes. If you run it across an entire 
> cluster you end up with a collection of results. What do you do with 
> those results? Do you look for nodes that are a certain percentage 
> outside of the mean? Or do you look for nodes that are outside one 
> standard deviation from the mean?
>
> Thanks!
>
> Jeff
>
> P.S. I have my own ideas but I'm really curious what other people do.
>
>
> On Wed, Mar 23, 2016 at 9:55 AM, Douglas Eadline <deadline at eadline.org 
> <mailto:deadline at eadline.org>> wrote:
>
>
>     > Thanks for the kind words and comments! Good catch with HPL. It's
>     > definitely part of the test regime. I typically run 3 tests for
>     > consistency:
>     >
>     > - Separate instance of STREAM2 on each node
>     > - Separate instance of HPL on each node
>     > - Simple MPI latency / bandwidth test called mpisweep that tests
>     every
>     > link (I'll put this up on github later as well)
>     >
>     > I now made the changes to the document.
>     >
>     > After this set of tests I'm not completely sure if NPB will add any
>     > further information. Those 3 benchmarks combined with the other
>     checks
>     > should pretty much expose all the possible issues. However, I
>     could be
>     > missing something again :)
>
>     NAS will verify the results. On several occasion I have
>     found NAS gave good numbers but the results did not verify.
>     This allowed me to look at lower level issues until I found
>     the problem (in one case a cable IIRC)
>
>     BTW, I run NAS all the time to test performance and make sure
>     things are running properly on my deskside clusters. I have done
>     it so often I can tell which test is running by watching wwtop
>     (Warewulf cluster based top that shows loads, net, memory but no
>     application names).
>
>     --
>     Doug
>
>     >
>     > Best regards,
>     > O-P
>     > --
>     > Olli-Pekka Lehto
>     > Development Manager
>     > Computing Platforms
>     > CSC - IT Center for Science Ltd.
>     > E-Mail: olli-pekka.lehto at csc.fi <mailto:olli-pekka.lehto at csc.fi>
>     > Tel: +358 50 381 8604 <tel:%2B358%2050%20381%208604>
>     > skype: oplehto // twitter: ople
>     >
>
>     >> From: "Jeffrey Layton" <laytonjb at gmail.com
>     <mailto:laytonjb at gmail.com>>
>     >> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi
>     <mailto:olli-pekka.lehto at csc.fi>>
>     >> Cc: beowulf at beowulf.org <mailto:beowulf at beowulf.org>
>     >> Sent: Tuesday, 22 March, 2016 16:45:20
>     >> Subject: Re: [Beowulf] Cluster consistency checks
>     >
>     >> Olli-Pekka,
>     >
>     >> Very nice - I'm glad you put a list down. Many of the things
>     that I do
>     >> are based
>     >> on experience.
>     >
>     >> A long time ago, in one of my previous jobs, we used to run NAS
>     Parallel
>     >> Benchmark (NPB) on single nodes to get a baseline of
>     performance. We
>     >> would look
>     >> for outliers and triage and debug them based on these results.
>     We're not
>     >> running the test for performance but to make sure the cluster was a
>     >> homogeneous
>     >> as possible. Have you done this before?
>     >
>     >> I've also seen people run HPL on single nodes and look for
>     outliers.
>     >> After
>     >> triaging these, HPL is run on smaller groups of nodes within a
>     single
>     >> switch,
>     >> look for outliers and triage them. This continues up to the entire
>     >> system. The
>     >> point is not to get a great HPL number to submit to the Top500 but
>     >> rather to
>     >> find potential network issues, particularly network links.
>     >
>     >> Thanks for the good work!
>     >
>     >> Jeff
>     >
>     >> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
>     >> olli-pekka.lehto at csc.fi <mailto:olli-pekka.lehto at csc.fi> >
>     >> wrote:
>     >
>     >>> Hi,
>     >
>     >>> I finally got around to writing down my cluster-consistency
>     checklist
>     >>> that I've
>     >>> been planning for a long time:
>     >
>     >>> https://github.com/oplehto/cluster-checks/
>     >>> The goal is to try to make the baseline installation of a
>     cluster as
>     >>> consistent
>     >>> as possible and make vendors work for their money. :) Of course
>     >>> hopefully
>     >>> publishing this will help vendors capture some of the issues
>     that slip
>     >>> through
>     >>> the cracks even before clusters are handed over. It's also a
>     good idea
>     >>> to run
>     >>> these types of checks during the lifetime of the system as there's
>     >>> always some
>     >>> consistency creep as hardware gets replaced.
>     >
>     >>> If someone is interested in contributing, pull requests or
>     comments on
>     >>> the list
>     >>> are welcome. I'm sure that there's something missing as well.
>     Right now
>     >>> it's
>     >>> just a text-file but making some nicer scripts and
>     postprocessing for
>     >>> the
>     >>> output might happen as well at some point. All the examples
>     are very HP
>     >>> oriented as well at this point.
>     >
>     >>> Best regards,
>     >>> Olli-Pekka
>     >>> --
>     >>> Olli-Pekka Lehto
>     >>> Development Manager
>     >>> Computing Platforms
>     >>> CSC - IT Center for Science Ltd.
>     >>> E-Mail: olli-pekka.lehto at csc.fi <mailto:olli-pekka.lehto at csc.fi>
>     >>> Tel: +358 50 381 8604 <tel:%2B358%2050%20381%208604>
>     >>> skype: oplehto // twitter: ople
>     >
>     >>> _______________________________________________
>     >>> Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin
>     >>> Computing
>     >>> To change your subscription (digest mode or unsubscribe) visit
>     >>> http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>
>     > --
>     > Mailscanner: Clean
>     >
>     > _______________________________________________
>     > Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     > To change your subscription (digest mode or unsubscribe) visit
>     > http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>
>
>     --
>     Doug
>
>     --
>     Mailscanner: Clean
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
Eliot Eshelman
Microway, Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160325/53025556/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Color-Coded_Benchmark.png
Type: image/png
Size: 25431 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160325/53025556/attachment-0001.png>


More information about the Beowulf mailing list