[Beowulf] Varying performance across identical cluster nodes.
hearnsj at googlemail.com
Thu Sep 14 06:25:58 PDT 2017
Prentice, as I understand it the problem here is that with the same OS
and IB drivers, there is a big difference in performance between stateful
and NFS root nodes.
Throwing my hat into the ring, try looking ot see if there is an
excessive rate of interrupts in the nfsroot case, coming from the network
watch cat /proc/interrupts
You will probably need a large terminal window for this (or probably there
is a way to filter the output)
On 14 September 2017 at 15:14, Prentice Bisbal <pbisbal at pppl.gov> wrote:
> Good question. I just checked using vmstat. When running xhpl on both
> systems, vmstat shows only zeros for si and so, even long after the
> performance degrades on the nfsroot instance. Just to be sure, I
> double-checked with top, which shows 0k of swap being used.
> On 09/13/2017 02:15 PM, Scott Atchley wrote:
> Are you swapping?
> On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lathama at gmail.com> wrote:
>> ack, so maybe validate you can reproduce with another nfs root. Maybe a
>> lab setup where a single server is serving nfs root to the node. If you
>> could reproduce in that way then it would give some direction. Beyond that
>> it sounds like an interesting problem.
>> On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal <pbisbal at pppl.gov>
>>> Okay, based on the various responses I've gotten here and on other
>>> lists, I feel I need to clarify things:
>>> This problem only occurs when I'm running our NFSroot based version of
>>> the OS (CentOS 6). When I run the same OS installed on a local disk, I do
>>> not have this problem, using the same exact server(s). For testing
>>> purposes, I'm using LINPACK, and running the same executable with the same
>>> HPL.dat file in both instances.
>>> Because I'm testing the same hardware using different OSes, this
>>> (should) eliminate the problem being in the BIOS, and faulty hardware. This
>>> leads me to believe it's most likely a software configuration issue, like a
>>> kernel tuning parameter, or some other software configuration issue.
>>> These are Supermicro servers, and it seems they do not provide CPU
>>> temps. I do see a chassis temp, but not the temps of the individual CPUs.
>>> While I agree that should be the first thing I look at, it's not an option
>>> for me. Other tools like FLIR and Infrared thermometers aren't really an
>>> option for me, either.
>>> What software configuration, either a kernel a parameter, configuration
>>> of numad or cpuspeed, or some other setting, could affect this?
>>> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>>>> I need your assistance debugging a problem:
>>>> I have a dozen servers that are all identical hardware: SuperMicro
>>>> servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
>>>> 6, the users have been complaining of wildly inconsistent performance
>>>> across these 12 nodes. I ran LINPACK on these nodes, and was able to
>>>> duplicate the problem, with performance varying from ~14 GFLOPS to 64
>>>> I've identified that performance on the slower nodes starts off fine,
>>>> and then slowly degrades throughout the LINPACK run. For example, on a node
>>>> with this problem, during first LINPACK test, I can see the performance
>>>> drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
>>>> continues throughout the remaining tests. At the start of subsequent tests,
>>>> performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
>>>> at the end of the test.
>>>> Because of the nature of this problem, I suspect this might be a
>>>> thermal issue. My guess is that the processor speed is being throttled to
>>>> prevent overheating on the "bad" nodes.
>>>> But here's the thing: this wasn't a problem until we upgraded to CentOS
>>>> 6. Where I work, we use a read-only NFSroot filesystem for our cluster
>>>> nodes, so all nodes are mounting and using the same exact read-only image
>>>> of the operating system. This only happens with these SuperMicro nodes, and
>>>> only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
>>>> installed CentOS 6 on a local disk, the nodes worked fine.
>>>> Any ideas where to look or what to tweak to fix this? Any idea why this
>>>> is only occuring with RHEL 6 w/ NFS root OS?
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>> - Andrew "lathama" Latham lathama at gmail.com http://lathama.com
>> <http://lathama.org> -
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf