[Beowulf] Varying performance across identical cluster nodes.
pbisbal at pppl.gov
Wed Sep 13 10:48:24 PDT 2017
Okay, based on the various responses I've gotten here and on other
lists, I feel I need to clarify things:
This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I
do not have this problem, using the same exact server(s). For testing
purposes, I'm using LINPACK, and running the same executable with the
same HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this
(should) eliminate the problem being in the BIOS, and faulty hardware.
This leads me to believe it's most likely a software configuration
issue, like a kernel tuning parameter, or some other software
These are Supermicro servers, and it seems they do not provide CPU
temps. I do see a chassis temp, but not the temps of the individual
CPUs. While I agree that should be the first thing I look at, it's not
an option for me. Other tools like FLIR and Infrared thermometers aren't
really an option for me, either.
What software configuration, either a kernel a parameter, configuration
of numad or cpuspeed, or some other setting, could affect this?
On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
> I need your assistance debugging a problem:
> I have a dozen servers that are all identical hardware: SuperMicro
> servers with AMD Opteron 6320 processors. Every since we upgraded to
> CentOS 6, the users have been complaining of wildly inconsistent
> performance across these 12 nodes. I ran LINPACK on these nodes, and
> was able to duplicate the problem, with performance varying from ~14
> GFLOPS to 64 GFLOPS.
> I've identified that performance on the slower nodes starts off fine,
> and then slowly degrades throughout the LINPACK run. For example, on a
> node with this problem, during first LINPACK test, I can see the
> performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant,
> downward trend continues throughout the remaining tests. At the start
> of subsequent tests, performance will jump up to about 9-10 GFLOPS,
> but then drop to 5-6 GLOPS at the end of the test.
> Because of the nature of this problem, I suspect this might be a
> thermal issue. My guess is that the processor speed is being throttled
> to prevent overheating on the "bad" nodes.
> But here's the thing: this wasn't a problem until we upgraded to
> CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
> cluster nodes, so all nodes are mounting and using the same exact
> read-only image of the operating system. This only happens with these
> SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
> NFSroot worked fine, and when I installed CentOS 6 on a local disk,
> the nodes worked fine.
> Any ideas where to look or what to tweak to fix this? Any idea why
> this is only occuring with RHEL 6 w/ NFS root OS?
More information about the Beowulf