<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Someone on the MVAPICH mailing list
posted this image a few years back. I find the style of
visualization compelling:<br>
<br>
<img alt="" src="cid:part1.07070706.07030507@microway.com"
height="480" width="588"><br>
<br>
<br>
On 03/25/2016 02:39 PM, Jeffrey Layton wrote:<br>
</div>
<blockquote
cite="mid:CAJfzO5SFxrxc=NqQx5846r1sRSMPewnHY-kBx5OOJgTJF3gVcg@mail.gmail.com"
type="cite">
<p><defanged_div dir="ltr"></defanged_div></p>
<p><defanged_div></defanged_div></p>
<p><defanged_div></defanged_div></p>
<p><defanged_div></defanged_div></p>
<p><defanged_div></defanged_div></p>
<p><defanged_div>Olli-Pekka, et al,<br>
<br>
</defanged_div></p>
<defanged_div>I took a look at your updated website - it looks
very good. One thing I wanted to ask, and this question is
probably one for the entire list, when you run a test across all
of the nodes in the cluster, what process do you use to
determine if nodes are "outliers" and need attention?<br>
<defanged_div>
<p><defanged_div><br>
</defanged_div></p>
<defanged_div>For example, one test you mention is to run
stream and look at the TRIAD results for all of the nodes.
If you run it across an entire cluster you end up with a
collection of results. What do you do with those results? Do
you look for nodes that are a certain percentage outside of
the mean? Or do you look for nodes that are outside one
standard deviation from the mean?<br>
<br>
<defanged_div>Thanks!<br>
<br>
<defanged_div>Jeff<br>
<br>
<defanged_div>P.S. I have my own ideas but I'm really
curious what other people do.<br>
<br>
<defanged_div>
<p><defanged_div class="gmail_extra"><br>
</defanged_div></p>
<p><defanged_div class="gmail_quote">On Wed, Mar 23,
2016 at 9:55 AM, Douglas Eadline <defanged_span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:deadline@eadline.org"
target="_blank">deadline@eadline.org</a>></defanged_span>
wrote:<br>
</defanged_div></p>
<blockquote class="gmail_quote"
defanged_style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex"><defanged_span
class=""><br>
> Thanks for the kind words and comments!
Good catch with HPL. It's<br>
> definitely part of the test regime. I
typically run 3 tests for<br>
> consistency:<br>
><br>
> - Separate instance of STREAM2 on each node<br>
> - Separate instance of HPL on each node<br>
> - Simple MPI latency / bandwidth test
called mpisweep that tests every<br>
> link (I'll put this up on github later as
well)<br>
><br>
> I now made the changes to the document.<br>
><br>
> After this set of tests I'm not completely
sure if NPB will add any<br>
> further information. Those 3 benchmarks
combined with the other checks<br>
> should pretty much expose all the possible
issues. However, I could be<br>
> missing something again :)<br>
<br>
</defanged_span>NAS will verify the results. On
several occasion I have<br>
found NAS gave good numbers but the results did
not verify.<br>
This allowed me to look at lower level issues
until I found<br>
the problem (in one case a cable IIRC)<br>
<br>
BTW, I run NAS all the time to test performance
and make sure<br>
things are running properly on my deskside
clusters. I have done<br>
it so often I can tell which test is running by
watching wwtop<br>
(Warewulf cluster based top that shows loads, net,
memory but no<br>
application names).<br>
<br>
--<br>
Doug<br>
<defanged_span class=""><br>
><br>
> Best regards,<br>
> O-P<br>
> --<br>
> Olli-Pekka Lehto<br>
> Development Manager<br>
> Computing Platforms<br>
> CSC - IT Center for Science Ltd.<br>
> E-Mail: <a moz-do-not-send="true"
href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a><br>
> Tel: <a moz-do-not-send="true"
href="tel:%2B358%2050%20381%208604"
defanged_value="+358503818604">+358 50 381
8604</a><br>
> skype: oplehto // twitter: ople<br>
><br>
</defanged_span>
<p><defanged_div></defanged_div></p>
<p><defanged_div class="h5">>> From:
"Jeffrey Layton" <<a moz-do-not-send="true"
href="mailto:laytonjb@gmail.com">laytonjb@gmail.com</a>><br>
>> To: "Olli-Pekka Lehto" <<a
moz-do-not-send="true"
href="mailto:olli-pekka.lehto@csc.fi"><a class="moz-txt-link-abbreviated" href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a></a>><br>
>> Cc: <a moz-do-not-send="true"
href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a><br>
>> Sent: Tuesday, 22 March, 2016
16:45:20<br>
>> Subject: Re: [Beowulf] Cluster
consistency checks<br>
><br>
>> Olli-Pekka,<br>
><br>
>> Very nice - I'm glad you put a list
down. Many of the things that I do<br>
>> are based<br>
>> on experience.<br>
><br>
>> A long time ago, in one of my
previous jobs, we used to run NAS Parallel<br>
>> Benchmark (NPB) on single nodes to
get a baseline of performance. We<br>
>> would look<br>
>> for outliers and triage and debug
them based on these results. We're not<br>
>> running the test for performance but
to make sure the cluster was a<br>
>> homogeneous<br>
>> as possible. Have you done this
before?<br>
><br>
>> I've also seen people run HPL on
single nodes and look for outliers.<br>
>> After<br>
>> triaging these, HPL is run on smaller
groups of nodes within a single<br>
>> switch,<br>
>> look for outliers and triage them.
This continues up to the entire<br>
>> system. The<br>
>> point is not to get a great HPL
number to submit to the Top500 but<br>
>> rather to<br>
>> find potential network issues,
particularly network links.<br>
><br>
>> Thanks for the good work!<br>
><br>
>> Jeff<br>
><br>
>> On Tue, Mar 22, 2016 at 11:32 AM,
Olli-Pekka Lehto <<br>
>> <a moz-do-not-send="true"
href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a>
><br>
>> wrote:<br>
><br>
>>> Hi,<br>
><br>
>>> I finally got around to writing
down my cluster-consistency checklist<br>
>>> that I've<br>
>>> been planning for a long time:<br>
><br>
>>> <a moz-do-not-send="true"
href="https://github.com/oplehto/cluster-checks/"
defanged_rel="noreferrer" target="_blank">https://github.com/oplehto/cluster-checks/</a><br>
>>> The goal is to try to make the
baseline installation of a cluster as<br>
>>> consistent<br>
>>> as possible and make vendors work
for their money. :) Of course<br>
>>> hopefully<br>
>>> publishing this will help vendors
capture some of the issues that slip<br>
>>> through<br>
>>> the cracks even before clusters
are handed over. It's also a good idea<br>
>>> to run<br>
>>> these types of checks during the
lifetime of the system as there's<br>
>>> always some<br>
>>> consistency creep as hardware
gets replaced.<br>
><br>
>>> If someone is interested in
contributing, pull requests or comments on<br>
>>> the list<br>
>>> are welcome. I'm sure that
there's something missing as well. Right now<br>
>>> it's<br>
>>> just a text-file but making some
nicer scripts and postprocessing for<br>
>>> the<br>
>>> output might happen as well at
some point. All the examples are very HP<br>
>>> oriented as well at this point.<br>
><br>
>>> Best regards,<br>
>>> Olli-Pekka<br>
>>> --<br>
>>> Olli-Pekka Lehto<br>
>>> Development Manager<br>
>>> Computing Platforms<br>
>>> CSC - IT Center for Science Ltd.<br>
>>> E-Mail: <a
moz-do-not-send="true"
href="mailto:olli-pekka.lehto@csc.fi"><a class="moz-txt-link-abbreviated" href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a></a><br>
>>> Tel: <a moz-do-not-send="true"
href="tel:%2B358%2050%20381%208604"
defanged_value="+358503818604">+358 50 381
8604</a><br>
>>> skype: oplehto // twitter: ople<br>
><br>
>>>
_______________________________________________<br>
>>> Beowulf mailing list, <a
moz-do-not-send="true"
href="mailto:Beowulf@beowulf.org"><a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a></a>
sponsored by Penguin<br>
>>> Computing<br>
>>> To change your subscription
(digest mode or unsubscribe) visit<br>
>>> <a moz-do-not-send="true"
href="http://www.beowulf.org/mailman/listinfo/beowulf"
defanged_rel="noreferrer" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
><br>
</defanged_div></p>
<defanged_div><defanged_div>> --<br>
> Mailscanner: Clean<br>
<defanged_span class="">><br>
>
_______________________________________________<br>
> Beowulf mailing list, <a
moz-do-not-send="true"
href="mailto:Beowulf@beowulf.org"><a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a></a>
sponsored by Penguin Computing<br>
> To change your subscription (digest
mode or unsubscribe) visit<br>
> <a moz-do-not-send="true"
href="http://www.beowulf.org/mailman/listinfo/beowulf"
defanged_rel="noreferrer" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
><br>
<br>
<br>
</defanged_span>--<br>
Doug<br>
<defanged_span class="HOEnZb"><font
color="#888888"><br>
--<br>
Mailscanner: Clean<br>
<br>
</font></defanged_span></defanged_div></defanged_div></blockquote>
<defanged_div><br>
<defanged_div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit <a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>
</pre>
</defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></blockquote>
<br>
<br>
<div class="moz-signature">-- <br>
Eliot Eshelman<br>
Microway, Inc.
</div>
</body>
</html>