[Beowulf] Cluster consistency checks

Tue Mar 22 09:01:49 PDT 2016

Thanks for the kind words and comments! Good catch with HPL. It's definitely part of the test regime. I typically run 3 tests for consistency: 

- Separate instance of STREAM2 on each node 
- Separate instance of HPL on each node 
- Simple MPI latency / bandwidth test called mpisweep that tests every link (I'll put this up on github later as well) 

I now made the changes to the document. 

After this set of tests I'm not completely sure if NPB will add any further information. Those 3 benchmarks combined with the other checks should pretty much expose all the possible issues. However, I could be missing something again :) 

Best regards, 
O-P 
-- 
Olli-Pekka Lehto 
Development Manager 
Computing Platforms 
CSC - IT Center for Science Ltd. 
E-Mail: olli-pekka.lehto at csc.fi 
Tel: +358 50 381 8604 
skype: oplehto // twitter: ople 

> From: "Jeffrey Layton" <laytonjb at gmail.com>
> To: "Olli-Pekka Lehto" <olli-pekka.lehto at csc.fi>
> Cc: beowulf at beowulf.org
> Sent: Tuesday, 22 March, 2016 16:45:20
> Subject: Re: [Beowulf] Cluster consistency checks

> Olli-Pekka,

> Very nice - I'm glad you put a list down. Many of the things that I do are based
> on experience.

> A long time ago, in one of my previous jobs, we used to run NAS Parallel
> Benchmark (NPB) on single nodes to get a baseline of performance. We would look
> for outliers and triage and debug them based on these results. We're not
> running the test for performance but to make sure the cluster was a homogeneous
> as possible. Have you done this before?

> I've also seen people run HPL on single nodes and look for outliers. After
> triaging these, HPL is run on smaller groups of nodes within a single switch,
> look for outliers and triage them. This continues up to the entire system. The
> point is not to get a great HPL number to submit to the Top500 but rather to
> find potential network issues, particularly network links.

> Thanks for the good work!

> Jeff

> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto < olli-pekka.lehto at csc.fi >
> wrote:

>> Hi,

>> I finally got around to writing down my cluster-consistency checklist that I've
>> been planning for a long time:

>> https://github.com/oplehto/cluster-checks/
>> The goal is to try to make the baseline installation of a cluster as consistent
>> as possible and make vendors work for their money. :) Of course hopefully
>> publishing this will help vendors capture some of the issues that slip through
>> the cracks even before clusters are handed over. It's also a good idea to run
>> these types of checks during the lifetime of the system as there's always some
>> consistency creep as hardware gets replaced.

>> If someone is interested in contributing, pull requests or comments on the list
>> are welcome. I'm sure that there's something missing as well. Right now it's
>> just a text-file but making some nicer scripts and postprocessing for the
>> output might happen as well at some point. All the examples are very HP
>> oriented as well at this point.

>> Best regards,
>> Olli-Pekka
>> --
>> Olli-Pekka Lehto
>> Development Manager
>> Computing Platforms
>> CSC - IT Center for Science Ltd.
>> E-Mail: olli-pekka.lehto at csc.fi
>> Tel: +358 50 381 8604
>> skype: oplehto // twitter: ople

>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160322/3fad4da1/attachment.html>