[Beowulf] Varying performance across identical cluster nodes.

Fri Sep 8 12:42:08 PDT 2017

I would also suspect a thermal issue, though it could also be firmware. To
verify a temperature problem, you might try setting up lm_sensors or
scraping "ipmitool sdr" output (whichever is easier) regularly and try to
make a performance-vs-temperature plot for each node. As Andrew mentioned,
it could also be firmware/CPU microcode. We recently tracked down a problem
with some of our nodes that ended up being microcode-related; the CPUs
would start in a high-power state, but end up getting stuck in a low-power
state, regardless of what power management settings we had set in the BIOS.

Skylar

On Fri, Sep 8, 2017 at 7:41 PM, Prentice Bisbal <pbisbal at pppl.gov> wrote:

> Beowulfers,
>
> I need your assistance debugging a problem:
>
> I have a dozen servers that are all identical hardware: SuperMicro servers
> with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the
> users have been complaining of wildly inconsistent performance across these
> 12 nodes. I ran LINPACK on these nodes, and was able to duplicate the
> problem, with performance varying from ~14 GFLOPS to 64 GFLOPS.
>
> I've identified that performance on the slower nodes starts off fine, and
> then slowly degrades throughout the LINPACK run. For example, on a node
> with this problem, during first LINPACK test, I can see the performance
> drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
> continues throughout the remaining tests. At the start of subsequent tests,
> performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
> at the end of the test.
>
> Because of the nature of this problem, I suspect this might be a thermal
> issue. My guess is that the processor speed is being throttled to prevent
> overheating on the "bad" nodes.
>
> But here's the thing: this wasn't a problem until we upgraded to CentOS 6.
> Where I work, we use a read-only NFSroot filesystem for our cluster nodes,
> so all nodes are mounting and using the same exact read-only image of the
> operating system. This only happens with these SuperMicro nodes, and only
> with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
> installed CentOS 6 on a local disk, the nodes worked fine.
>
> Any ideas where to look or what to tweak to fix this? Any idea why this is
> only occuring with RHEL 6 w/ NFS root OS?
>
> --
> Prentice
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170908/ad63849d/attachment.html>