[Beowulf] Cluster consistency checks

Jeffrey Layton laytonjb at gmail.com
Tue Mar 22 08:45:20 PDT 2016


Olli-Pekka,

Very nice - I'm glad you put a list down. Many of the things that I do are
based on experience.

A long time ago, in one of my previous jobs, we used to run NAS Parallel
Benchmark (NPB) on single nodes to get a baseline of performance. We would
look for outliers and triage and debug them based on these results. We're
not running the test for performance but to make sure the cluster was a
homogeneous as possible. Have you done this before?

I've also seen people run HPL on single nodes and look for outliers. After
triaging these, HPL is run on smaller groups of nodes within a single
switch, look for outliers and triage them. This continues up to the entire
system. The point is not to get a great HPL number to submit to the Top500
but rather to find potential network issues, particularly network links.

Thanks for the good work!

Jeff


On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <olli-pekka.lehto at csc.fi>
wrote:

> Hi,
>
> I finally got around to writing down my cluster-consistency checklist that
> I've been planning for a long time:
>
> https://github.com/oplehto/cluster-checks/
>
> The goal is to try to make the baseline installation of a cluster as
> consistent as possible and make vendors work for their money. :) Of course
> hopefully publishing this will help vendors capture some of the issues that
> slip through the cracks even before clusters are handed over. It's also a
> good idea to run these types of checks during the lifetime of the system as
> there's always some consistency creep as hardware gets replaced.
>
> If someone is interested in contributing, pull requests or comments on the
> list are welcome. I'm sure that there's something missing as well. Right
> now it's just a text-file but making some nicer scripts and postprocessing
> for the output might happen as well at some point. All the examples are
> very HP oriented as well at this point.
>
> Best regards,
> Olli-Pekka
> --
> Olli-Pekka Lehto
> Development Manager
> Computing Platforms
> CSC - IT Center for Science Ltd.
> E-Mail: olli-pekka.lehto at csc.fi
> Tel: +358 50 381 8604
> skype: oplehto // twitter: ople
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160322/7c05db4c/attachment.html>


More information about the Beowulf mailing list