[Beowulf] Varying performance across identical cluster nodes.

Thu Sep 14 06:26:21 PDT 2017

Switching away from NFS root is not something I can change right now.

Prentice

On 09/13/2017 02:45 PM, Joe Landman wrote:
> FWIW:  I gave up on NFS boot a while ago, due in part to problems with 
> performance that were hard to track down.  The environment I created 
> to do completely ramboot boots at scale, allows me to pivot to NFS if 
> desired (boot time switch).  But I rarely use that.  Pure ramboot has 
> been a joy to work with as compared to NFS.
>
>
> On 09/13/2017 01:48 PM, Prentice Bisbal wrote:
>> Okay, based on the various responses I've gotten here and on other 
>> lists, I feel I need to clarify things:
>>
>> This problem only occurs when I'm running our NFSroot based version 
>> of the OS (CentOS 6). When I run the same OS installed on a local 
>> disk, I do not have this problem, using the same exact server(s).  
>> For testing purposes, I'm using LINPACK, and running the same 
>> executable  with the same HPL.dat file in both instances.
>>
>> Because I'm testing the same hardware using different OSes, this 
>> (should) eliminate the problem being in the BIOS, and faulty 
>> hardware. This leads me to believe it's most likely a software 
>> configuration issue, like a kernel tuning parameter, or some other 
>> software configuration issue.
>>
>> These are Supermicro servers, and it seems they do not provide CPU 
>> temps. I do see a chassis temp, but not the temps of the individual 
>> CPUs. While I agree that should be the first thing I look at, it's 
>> not an option for me. Other tools like FLIR and Infrared thermometers 
>> aren't really an option for me, either.
>>
>> What software configuration, either a kernel a parameter, 
>> configuration of numad or cpuspeed, or some other setting, could 
>> affect this?
>>
>> Prentice
>>
>> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>>> Beowulfers,
>>>
>>> I need your assistance debugging a problem:
>>>
>>> I have a dozen servers that are all identical hardware: SuperMicro 
>>> servers with AMD Opteron 6320 processors. Every since we upgraded to 
>>> CentOS 6, the users have been complaining of wildly inconsistent 
>>> performance across these 12 nodes. I ran LINPACK on these nodes, and 
>>> was able to duplicate the problem, with performance varying from ~14 
>>> GFLOPS to 64 GFLOPS.
>>>
>>> I've identified that performance on the slower nodes starts off 
>>> fine, and then slowly degrades throughout the LINPACK run. For 
>>> example, on a node with this problem, during first LINPACK test, I 
>>> can see the performance drop from 115 GFLOPS down to 11.3 GFLOPS. 
>>> That constant, downward trend continues throughout the remaining 
>>> tests. At the start of subsequent tests, performance will jump up to 
>>> about 9-10 GFLOPS, but then drop to 5-6 GLOPS at the end of the test.
>>>
>>> Because of the nature of this problem, I suspect this might be a 
>>> thermal issue. My guess is that the processor speed is being 
>>> throttled to prevent overheating on the "bad" nodes.
>>>
>>> But here's the thing: this wasn't a problem until we upgraded to 
>>> CentOS 6. Where I work, we use a read-only NFSroot filesystem for 
>>> our cluster nodes, so all nodes are mounting and using the same 
>>> exact read-only image of the operating system. This only happens 
>>> with these SuperMicro nodes, and only with the CentOS 6 on NFSroot. 
>>> RHEL5 on NFSroot worked fine, and when I installed CentOS 6 on a 
>>> local disk, the nodes worked fine.
>>>
>>> Any ideas where to look or what to tweak to fix this? Any idea why 
>>> this is only occuring with RHEL 6 w/ NFS root OS?
>>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>