[Beowulf] Varying performance across identical cluster nodes.
lathama at gmail.com
Fri Sep 8 11:56:23 PDT 2017
Shooting from hip
1. BIOS identical version and settings
2. Firmware on device (I assume nothing just thinking out loud)
3. Re-seat fans/replace (oxidized contacts - silly but why not)
4. Verify the power supplies are identical (various watts etc... maybe swap
out and test)
5. Memory cooling heat-sinks? (have seen identical orders with different
memory some with heatsinks)
6. Thermal paste
7. Blank panels on empty drive bays
8. Location in rack/room
9. Blanking on rack
Shared to promote thought
On Fri, Sep 8, 2017 at 1:41 PM, Prentice Bisbal <pbisbal at pppl.gov> wrote:
> I need your assistance debugging a problem:
> I have a dozen servers that are all identical hardware: SuperMicro servers
> with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the
> users have been complaining of wildly inconsistent performance across these
> 12 nodes. I ran LINPACK on these nodes, and was able to duplicate the
> problem, with performance varying from ~14 GFLOPS to 64 GFLOPS.
> I've identified that performance on the slower nodes starts off fine, and
> then slowly degrades throughout the LINPACK run. For example, on a node
> with this problem, during first LINPACK test, I can see the performance
> drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
> continues throughout the remaining tests. At the start of subsequent tests,
> performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
> at the end of the test.
> Because of the nature of this problem, I suspect this might be a thermal
> issue. My guess is that the processor speed is being throttled to prevent
> overheating on the "bad" nodes.
> But here's the thing: this wasn't a problem until we upgraded to CentOS 6.
> Where I work, we use a read-only NFSroot filesystem for our cluster nodes,
> so all nodes are mounting and using the same exact read-only image of the
> operating system. This only happens with these SuperMicro nodes, and only
> with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
> installed CentOS 6 on a local disk, the nodes worked fine.
> Any ideas where to look or what to tweak to fix this? Any idea why this is
> only occuring with RHEL 6 w/ NFS root OS?
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
- Andrew "lathama" Latham lathama at gmail.com http://lathama.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf