[Beowulf] Varying performance across identical cluster nodes.

Wed Sep 13 11:15:52 PDT 2017

Are you swapping?

On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lathama at gmail.com> wrote:

> ack, so maybe validate you can reproduce with another nfs root. Maybe a
> lab setup where a single server is serving nfs root to the node. If you
> could reproduce in that way then it would give some direction. Beyond that
> it sounds like an interesting problem.
>
> On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal <pbisbal at pppl.gov>
> wrote:
>
>> Okay, based on the various responses I've gotten here and on other lists,
>> I feel I need to clarify things:
>>
>> This problem only occurs when I'm running our NFSroot based version of
>> the OS (CentOS 6). When I run the same OS installed on a local disk, I do
>> not have this problem, using the same exact server(s).  For testing
>> purposes, I'm using LINPACK, and running the same executable  with the same
>> HPL.dat file in both instances.
>>
>> Because I'm testing the same hardware using different OSes, this (should)
>> eliminate the problem being in the BIOS, and faulty hardware. This leads me
>> to believe it's most likely a software configuration issue, like a kernel
>> tuning parameter, or some other software configuration issue.
>>
>> These are Supermicro servers, and it seems they do not provide CPU temps.
>> I do see a chassis temp, but not the temps of the individual CPUs. While I
>> agree that should be the first thing I look at, it's not an option for me.
>> Other tools like FLIR and Infrared thermometers aren't really an option for
>> me, either.
>>
>> What software configuration, either a kernel a parameter, configuration
>> of numad or cpuspeed, or some other setting, could affect this?
>>
>> Prentice
>>
>> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>>
>>> Beowulfers,
>>>
>>> I need your assistance debugging a problem:
>>>
>>> I have a dozen servers that are all identical hardware: SuperMicro
>>> servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
>>> 6, the users have been complaining of wildly inconsistent performance
>>> across these 12 nodes. I ran LINPACK on these nodes, and was able to
>>> duplicate the problem, with performance varying from ~14 GFLOPS to 64
>>> GFLOPS.
>>>
>>> I've identified that performance on the slower nodes starts off fine,
>>> and then slowly degrades throughout the LINPACK run. For example, on a node
>>> with this problem, during first LINPACK test, I can see the performance
>>> drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
>>> continues throughout the remaining tests. At the start of subsequent tests,
>>> performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
>>> at the end of the test.
>>>
>>> Because of the nature of this problem, I suspect this might be a thermal
>>> issue. My guess is that the processor speed is being throttled to prevent
>>> overheating on the "bad" nodes.
>>>
>>> But here's the thing: this wasn't a problem until we upgraded to CentOS
>>> 6. Where I work, we use a read-only NFSroot filesystem for our cluster
>>> nodes, so all nodes are mounting and using the same exact read-only image
>>> of the operating system. This only happens with these SuperMicro nodes, and
>>> only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
>>> installed CentOS 6 on a local disk, the nodes worked fine.
>>>
>>> Any ideas where to look or what to tweak to fix this? Any idea why this
>>> is only occuring with RHEL 6 w/ NFS root OS?
>>>
>>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
>
> --
> - Andrew "lathama" Latham lathama at gmail.com http://lathama.com
> <http://lathama.org> -
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170913/771b3e35/attachment-0001.html>