[Beowulf] Varying performance across identical cluster nodes.
Joe Landman
joe.landman at gmail.com
Thu Sep 14 06:29:17 PDT 2017
On 09/14/2017 09:25 AM, John Hearns via Beowulf wrote:
> Prentice, as I understand it the problem here is that with the same
> OS and IB drivers, there is a big difference in performance between
> stateful and NFS root nodes.
> Throwing my hat into the ring, try looking ot see if there is an
> excessive rate of interrupts in the nfsroot case, coming from the
> network card:
>
> watch cat /proc/interrupts
>
> You will probably need a large terminal window for this (or probably
> there is a way to filter the output)
dstat is helpful here.
>
>
>
>
>
>
>
> On 14 September 2017 at 15:14, Prentice Bisbal <pbisbal at pppl.gov
> <mailto:pbisbal at pppl.gov>> wrote:
>
> Good question. I just checked using vmstat. When running xhpl on
> both systems, vmstat shows only zeros for si and so, even long
> after the performance degrades on the nfsroot instance. Just to be
> sure, I double-checked with top, which shows 0k of swap being used.
>
> Prentice
>
> On 09/13/2017 02:15 PM, Scott Atchley wrote:
>> Are you swapping?
>>
>> On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lathama at gmail.com
>> <mailto:lathama at gmail.com>> wrote:
>>
>> ack, so maybe validate you can reproduce with another nfs
>> root. Maybe a lab setup where a single server is serving nfs
>> root to the node. If you could reproduce in that way then it
>> would give some direction. Beyond that it sounds like an
>> interesting problem.
>>
>> On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
>> <pbisbal at pppl.gov <mailto:pbisbal at pppl.gov>> wrote:
>>
>> Okay, based on the various responses I've gotten here and
>> on other lists, I feel I need to clarify things:
>>
>> This problem only occurs when I'm running our NFSroot
>> based version of the OS (CentOS 6). When I run the same
>> OS installed on a local disk, I do not have this problem,
>> using the same exact server(s). For testing purposes,
>> I'm using LINPACK, and running the same executable with
>> the same HPL.dat file in both instances.
>>
>> Because I'm testing the same hardware using different
>> OSes, this (should) eliminate the problem being in the
>> BIOS, and faulty hardware. This leads me to believe it's
>> most likely a software configuration issue, like a kernel
>> tuning parameter, or some other software configuration issue.
>>
>> These are Supermicro servers, and it seems they do not
>> provide CPU temps. I do see a chassis temp, but not the
>> temps of the individual CPUs. While I agree that should
>> be the first thing I look at, it's not an option for me.
>> Other tools like FLIR and Infrared thermometers aren't
>> really an option for me, either.
>>
>> What software configuration, either a kernel a parameter,
>> configuration of numad or cpuspeed, or some other
>> setting, could affect this?
>>
>> Prentice
>>
>> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>>
>> Beowulfers,
>>
>> I need your assistance debugging a problem:
>>
>> I have a dozen servers that are all identical
>> hardware: SuperMicro servers with AMD Opteron 6320
>> processors. Every since we upgraded to CentOS 6, the
>> users have been complaining of wildly inconsistent
>> performance across these 12 nodes. I ran LINPACK on
>> these nodes, and was able to duplicate the problem,
>> with performance varying from ~14 GFLOPS to 64 GFLOPS.
>>
>> I've identified that performance on the slower nodes
>> starts off fine, and then slowly degrades throughout
>> the LINPACK run. For example, on a node with this
>> problem, during first LINPACK test, I can see the
>> performance drop from 115 GFLOPS down to 11.3 GFLOPS.
>> That constant, downward trend continues throughout
>> the remaining tests. At the start of subsequent
>> tests, performance will jump up to about 9-10 GFLOPS,
>> but then drop to 5-6 GLOPS at the end of the test.
>>
>> Because of the nature of this problem, I suspect this
>> might be a thermal issue. My guess is that the
>> processor speed is being throttled to prevent
>> overheating on the "bad" nodes.
>>
>> But here's the thing: this wasn't a problem until we
>> upgraded to CentOS 6. Where I work, we use a
>> read-only NFSroot filesystem for our cluster nodes,
>> so all nodes are mounting and using the same exact
>> read-only image of the operating system. This only
>> happens with these SuperMicro nodes, and only with
>> the CentOS 6 on NFSroot. RHEL5 on NFSroot worked
>> fine, and when I installed CentOS 6 on a local disk,
>> the nodes worked fine.
>>
>> Any ideas where to look or what to tweak to fix this?
>> Any idea why this is only occuring with RHEL 6 w/ NFS
>> root OS?
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe)
>> visit http://www.beowulf.org/mailman/listinfo/beowulf
>> <http://www.beowulf.org/mailman/listinfo/beowulf>
>>
>>
>>
>>
>> --
>> - Andrew "lathama" Latham lathama at gmail.com
>> <mailto:lathama at gmail.com> http://lathama.com
>> <http://lathama.org> -
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe)
>> visit http://www.beowulf.org/mailman/listinfo/beowulf
>> <http://www.beowulf.org/mailman/listinfo/beowulf>
>>
>>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> <http://www.beowulf.org/mailman/listinfo/beowulf>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
More information about the Beowulf
mailing list