[Beowulf] Varying performance across identical cluster nodes.
joe.landman at gmail.com
Fri Sep 8 20:56:28 PDT 2017
On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
> But here's the thing: this wasn't a problem until we upgraded to
> CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
> cluster nodes, so all nodes are mounting and using the same exact
> read-only image of the operating system. This only happens with these
> SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
> NFSroot worked fine, and when I installed CentOS 6 on a local disk,
> the nodes worked fine.
> Any ideas where to look or what to tweak to fix this? Any idea why
> this is only occuring with RHEL 6 w/ NFS root OS?
Sounds suspiciously like a network or other driver running hard in a
tight polling mode causing a growing number of CSW/Ints over time. Since
these are opteron (really? still in use?) chances are you might have a
firmware issue on the set of slower nodes, that had been corrected on
the other nodes. With NFS root, if you have a node locking a
particular file that the other nodes want to write to, the node can
appear slow while it waits on the IO.
You might try running dstat and saving output into a file from boot
onwards. Then run the tests, and see if the int or CSW are being driven
very high. Pay attention to the usr/idl and other percentages.
You can also grab temperature stats. Helps if you have ipmi.
ipmitool sdr | grep Temp
CPU1 Temp | 35 degrees C | ok
CPU2 Temp | 35 degrees C | ok
System Temp | 35 degrees C | ok
Peripheral Temp | 38 degrees C | ok
PCH Temp | 43 degrees C | ok
If not, sensors
Package id 1: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 0: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 1: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 2: +33.0°C (high = +82.0°C, crit = +92.0°C)
Core 3: +34.0°C (high = +82.0°C, crit = +92.0°C)
e: joe.landman at gmail.com
More information about the Beowulf