[Beowulf] Varying performance across identical cluster nodes.
Prentice Bisbal
pbisbal at pppl.gov
Thu Sep 14 06:24:16 PDT 2017
Another good question. The systems with the nfsroot os still have a
local disk. That local disk has a /var partition where logs are written.
Both system do send some logs to a remote log server. While
/etc/rsyslog.conf files were almost identical, I copied the one from the
nfsroot system to the local-os system to make sure they were identical.
This has had no impact on the performance of xhpl.
Prentice
On 09/13/2017 02:16 PM, Scott Atchley wrote:
> Are you logging something goes to the disk in the local case, but that
> is competing for network bandwidth when NFS mounting?
>
> On Wed, Sep 13, 2017 at 2:15 PM, Scott Atchley
> <e.scott.atchley at gmail.com <mailto:e.scott.atchley at gmail.com>> wrote:
>
> Are you swapping?
>
> On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lathama at gmail.com
> <mailto:lathama at gmail.com>> wrote:
>
> ack, so maybe validate you can reproduce with another nfs
> root. Maybe a lab setup where a single server is serving nfs
> root to the node. If you could reproduce in that way then it
> would give some direction. Beyond that it sounds like an
> interesting problem.
>
> On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
> <pbisbal at pppl.gov <mailto:pbisbal at pppl.gov>> wrote:
>
> Okay, based on the various responses I've gotten here and
> on other lists, I feel I need to clarify things:
>
> This problem only occurs when I'm running our NFSroot
> based version of the OS (CentOS 6). When I run the same OS
> installed on a local disk, I do not have this problem,
> using the same exact server(s). For testing purposes, I'm
> using LINPACK, and running the same executable with the
> same HPL.dat file in both instances.
>
> Because I'm testing the same hardware using different
> OSes, this (should) eliminate the problem being in the
> BIOS, and faulty hardware. This leads me to believe it's
> most likely a software configuration issue, like a kernel
> tuning parameter, or some other software configuration issue.
>
> These are Supermicro servers, and it seems they do not
> provide CPU temps. I do see a chassis temp, but not the
> temps of the individual CPUs. While I agree that should be
> the first thing I look at, it's not an option for me.
> Other tools like FLIR and Infrared thermometers aren't
> really an option for me, either.
>
> What software configuration, either a kernel a parameter,
> configuration of numad or cpuspeed, or some other setting,
> could affect this?
>
> Prentice
>
> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>
> Beowulfers,
>
> I need your assistance debugging a problem:
>
> I have a dozen servers that are all identical
> hardware: SuperMicro servers with AMD Opteron 6320
> processors. Every since we upgraded to CentOS 6, the
> users have been complaining of wildly inconsistent
> performance across these 12 nodes. I ran LINPACK on
> these nodes, and was able to duplicate the problem,
> with performance varying from ~14 GFLOPS to 64 GFLOPS.
>
> I've identified that performance on the slower nodes
> starts off fine, and then slowly degrades throughout
> the LINPACK run. For example, on a node with this
> problem, during first LINPACK test, I can see the
> performance drop from 115 GFLOPS down to 11.3 GFLOPS.
> That constant, downward trend continues throughout the
> remaining tests. At the start of subsequent tests,
> performance will jump up to about 9-10 GFLOPS, but
> then drop to 5-6 GLOPS at the end of the test.
>
> Because of the nature of this problem, I suspect this
> might be a thermal issue. My guess is that the
> processor speed is being throttled to prevent
> overheating on the "bad" nodes.
>
> But here's the thing: this wasn't a problem until we
> upgraded to CentOS 6. Where I work, we use a read-only
> NFSroot filesystem for our cluster nodes, so all nodes
> are mounting and using the same exact read-only image
> of the operating system. This only happens with these
> SuperMicro nodes, and only with the CentOS 6 on
> NFSroot. RHEL5 on NFSroot worked fine, and when I
> installed CentOS 6 on a local disk, the nodes worked fine.
>
> Any ideas where to look or what to tweak to fix this?
> Any idea why this is only occuring with RHEL 6 w/ NFS
> root OS?
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe)
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> <http://www.beowulf.org/mailman/listinfo/beowulf>
>
>
>
>
> --
> - Andrew "lathama" Latham lathama at gmail.com
> <mailto:lathama at gmail.com> http://lathama.com
> <http://lathama.org> -
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> <http://www.beowulf.org/mailman/listinfo/beowulf>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170914/87a2bd26/attachment.html>
More information about the Beowulf
mailing list