[Beowulf] Varying performance across identical cluster nodes.

Prentice Bisbal pbisbal at pppl.gov
Fri Sep 8 11:41:14 PDT 2017


Beowulfers,

I need your assistance debugging a problem:

I have a dozen servers that are all identical hardware: SuperMicro 
servers with AMD Opteron 6320 processors. Every since we upgraded to 
CentOS 6, the users have been complaining of wildly inconsistent 
performance across these 12 nodes. I ran LINPACK on these nodes, and was 
able to duplicate the problem, with performance varying from ~14 GFLOPS 
to 64 GFLOPS.

I've identified that performance on the slower nodes starts off fine, 
and then slowly degrades throughout the LINPACK run. For example, on a 
node with this problem, during first LINPACK test, I can see the 
performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, 
downward trend continues throughout the remaining tests. At the start of 
subsequent tests, performance will jump up to about 9-10 GFLOPS, but 
then drop to 5-6 GLOPS at the end of the test.

Because of the nature of this problem, I suspect this might be a thermal 
issue. My guess is that the processor speed is being throttled to 
prevent overheating on the "bad" nodes.

But here's the thing: this wasn't a problem until we upgraded to CentOS 
6. Where I work, we use a read-only NFSroot filesystem for our cluster 
nodes, so all nodes are mounting and using the same exact read-only 
image of the operating system. This only happens with these SuperMicro 
nodes, and only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked 
fine, and when I installed CentOS 6 on a local disk, the nodes worked fine.

Any ideas where to look or what to tweak to fix this? Any idea why this 
is only occuring with RHEL 6 w/ NFS root OS?

-- 
Prentice



More information about the Beowulf mailing list