[Beowulf] Cluster consistency checks
Eliot Eshelman
eliote at microway.com
Fri Mar 25 11:56:17 PDT 2016
Someone on the MVAPICH mailing list posted this image a few years back.
I find the style of visualization compelling:
On 03/25/2016 02:39 PM, Jeffrey Layton wrote:
>
> Olli-Pekka, et al,
>
> I took a look at your updated website - it looks very good. One thing
> I wanted to ask, and this question is probably one for the entire
> list, when you run a test across all of the nodes in the cluster, what
> process do you use to determine if nodes are "outliers" and need
> attention?
>
>
> For example, one test you mention is to run stream and look at the
> TRIAD results for all of the nodes. If you run it across an entire
> cluster you end up with a collection of results. What do you do with
> those results? Do you look for nodes that are a certain percentage
> outside of the mean? Or do you look for nodes that are outside one
> standard deviation from the mean?
>
> Thanks!
>
> Jeff
>
> P.S. I have my own ideas but I'm really curious what other people do.
>
>
> On Wed, Mar 23, 2016 at 9:55 AM, Douglas Eadline <deadline at eadline.org
> <mailto:deadline at eadline.org>> wrote:
>
>
> > Thanks for the kind words and comments! Good catch with HPL. It's
> > definitely part of the test regime. I typically run 3 tests for
> > consistency:
> >
> > - Separate instance of STREAM2 on each node
> > - Separate instance of HPL on each node
> > - Simple MPI latency / bandwidth test called mpisweep that tests
> every
> > link (I'll put this up on github later as well)
> >
> > I now made the changes to the document.
> >
> > After this set of tests I'm not completely sure if NPB will add any
> > further information. Those 3 benchmarks combined with the other
> checks
> > should pretty much expose all the possible issues. However, I
> could be
> > missing something again :)
>
> NAS will verify the results. On several occasion I have
> found NAS gave good numbers but the results did not verify.
> This allowed me to look at lower level issues until I found
> the problem (in one case a cable IIRC)
>
> BTW, I run NAS all the time to test performance and make sure
> things are running properly on my deskside clusters. I have done
> it so often I can tell which test is running by watching wwtop
> (Warewulf cluster based top that shows loads, net, memory but no
> application names).
>
> --
> Doug
>
> >
> > Best regards,
> > O-P
> > --
> > Olli-Pekka Lehto
> > Development Manager
> > Computing Platforms
> > CSC - IT Center for Science Ltd.
> > E-Mail: olli-pekka.lehto at csc.fi <mailto:olli-pekka.lehto at csc.fi>
> > Tel: +358 50 381 8604 <tel:%2B358%2050%20381%208604>
> > skype: oplehto // twitter: ople
> >
>
> >> From: "Jeffrey Layton" <laytonjb at gmail.com
> <mailto:laytonjb at gmail.com>>
> >> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi
> <mailto:olli-pekka.lehto at csc.fi>>
> >> Cc: beowulf at beowulf.org <mailto:beowulf at beowulf.org>
> >> Sent: Tuesday, 22 March, 2016 16:45:20
> >> Subject: Re: [Beowulf] Cluster consistency checks
> >
> >> Olli-Pekka,
> >
> >> Very nice - I'm glad you put a list down. Many of the things
> that I do
> >> are based
> >> on experience.
> >
> >> A long time ago, in one of my previous jobs, we used to run NAS
> Parallel
> >> Benchmark (NPB) on single nodes to get a baseline of
> performance. We
> >> would look
> >> for outliers and triage and debug them based on these results.
> We're not
> >> running the test for performance but to make sure the cluster was a
> >> homogeneous
> >> as possible. Have you done this before?
> >
> >> I've also seen people run HPL on single nodes and look for
> outliers.
> >> After
> >> triaging these, HPL is run on smaller groups of nodes within a
> single
> >> switch,
> >> look for outliers and triage them. This continues up to the entire
> >> system. The
> >> point is not to get a great HPL number to submit to the Top500 but
> >> rather to
> >> find potential network issues, particularly network links.
> >
> >> Thanks for the good work!
> >
> >> Jeff
> >
> >> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
> >> olli-pekka.lehto at csc.fi <mailto:olli-pekka.lehto at csc.fi> >
> >> wrote:
> >
> >>> Hi,
> >
> >>> I finally got around to writing down my cluster-consistency
> checklist
> >>> that I've
> >>> been planning for a long time:
> >
> >>> https://github.com/oplehto/cluster-checks/
> >>> The goal is to try to make the baseline installation of a
> cluster as
> >>> consistent
> >>> as possible and make vendors work for their money. :) Of course
> >>> hopefully
> >>> publishing this will help vendors capture some of the issues
> that slip
> >>> through
> >>> the cracks even before clusters are handed over. It's also a
> good idea
> >>> to run
> >>> these types of checks during the lifetime of the system as there's
> >>> always some
> >>> consistency creep as hardware gets replaced.
> >
> >>> If someone is interested in contributing, pull requests or
> comments on
> >>> the list
> >>> are welcome. I'm sure that there's something missing as well.
> Right now
> >>> it's
> >>> just a text-file but making some nicer scripts and
> postprocessing for
> >>> the
> >>> output might happen as well at some point. All the examples
> are very HP
> >>> oriented as well at this point.
> >
> >>> Best regards,
> >>> Olli-Pekka
> >>> --
> >>> Olli-Pekka Lehto
> >>> Development Manager
> >>> Computing Platforms
> >>> CSC - IT Center for Science Ltd.
> >>> E-Mail: olli-pekka.lehto at csc.fi <mailto:olli-pekka.lehto at csc.fi>
> >>> Tel: +358 50 381 8604 <tel:%2B358%2050%20381%208604>
> >>> skype: oplehto // twitter: ople
> >
> >>> _______________________________________________
> >>> Beowulf mailing list, Beowulf at beowulf.org
> <mailto:Beowulf at beowulf.org> sponsored by Penguin
> >>> Computing
> >>> To change your subscription (digest mode or unsubscribe) visit
> >>> http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
> > --
> > Mailscanner: Clean
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
>
> --
> Doug
>
> --
> Mailscanner: Clean
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Eliot Eshelman
Microway, Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160325/53025556/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Color-Coded_Benchmark.png
Type: image/png
Size: 25431 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160325/53025556/attachment-0001.png>
More information about the Beowulf
mailing list