[Beowulf] Varying performance across identical cluster nodes.

Wed Sep 13 11:45:41 PDT 2017

FWIW:  I gave up on NFS boot a while ago, due in part to problems with 
performance that were hard to track down.  The environment I created to 
do completely ramboot boots at scale, allows me to pivot to NFS if 
desired (boot time switch).  But I rarely use that.  Pure ramboot has 
been a joy to work with as compared to NFS.

On 09/13/2017 01:48 PM, Prentice Bisbal wrote:
> Okay, based on the various responses I've gotten here and on other 
> lists, I feel I need to clarify things:
>
> This problem only occurs when I'm running our NFSroot based version of 
> the OS (CentOS 6). When I run the same OS installed on a local disk, I 
> do not have this problem, using the same exact server(s).  For testing 
> purposes, I'm using LINPACK, and running the same executable  with the 
> same HPL.dat file in both instances.
>
> Because I'm testing the same hardware using different OSes, this 
> (should) eliminate the problem being in the BIOS, and faulty hardware. 
> This leads me to believe it's most likely a software configuration 
> issue, like a kernel tuning parameter, or some other software 
> configuration issue.
>
> These are Supermicro servers, and it seems they do not provide CPU 
> temps. I do see a chassis temp, but not the temps of the individual 
> CPUs. While I agree that should be the first thing I look at, it's not 
> an option for me. Other tools like FLIR and Infrared thermometers 
> aren't really an option for me, either.
>
> What software configuration, either a kernel a parameter, 
> configuration of numad or cpuspeed, or some other setting, could 
> affect this?
>
> Prentice
>
> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>> Beowulfers,
>>
>> I need your assistance debugging a problem:
>>
>> I have a dozen servers that are all identical hardware: SuperMicro 
>> servers with AMD Opteron 6320 processors. Every since we upgraded to 
>> CentOS 6, the users have been complaining of wildly inconsistent 
>> performance across these 12 nodes. I ran LINPACK on these nodes, and 
>> was able to duplicate the problem, with performance varying from ~14 
>> GFLOPS to 64 GFLOPS.
>>
>> I've identified that performance on the slower nodes starts off fine, 
>> and then slowly degrades throughout the LINPACK run. For example, on 
>> a node with this problem, during first LINPACK test, I can see the 
>> performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, 
>> downward trend continues throughout the remaining tests. At the start 
>> of subsequent tests, performance will jump up to about 9-10 GFLOPS, 
>> but then drop to 5-6 GLOPS at the end of the test.
>>
>> Because of the nature of this problem, I suspect this might be a 
>> thermal issue. My guess is that the processor speed is being 
>> throttled to prevent overheating on the "bad" nodes.
>>
>> But here's the thing: this wasn't a problem until we upgraded to 
>> CentOS 6. Where I work, we use a read-only NFSroot filesystem for our 
>> cluster nodes, so all nodes are mounting and using the same exact 
>> read-only image of the operating system. This only happens with these 
>> SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on 
>> NFSroot worked fine, and when I installed CentOS 6 on a local disk, 
>> the nodes worked fine.
>>
>> Any ideas where to look or what to tweak to fix this? Any idea why 
>> this is only occuring with RHEL 6 w/ NFS root OS?
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman