[Beowulf] bizarre scaling behavior on a Nehalem

Wed Aug 12 11:02:15 PDT 2009

Rahul Nabar wrote:
> On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney<Craig.Tierney at noaa.gov> wrote:
>> What do you mean normally?  I am running Centos 5.3 with 2.6.18-128.2.1
>> right now on a 448 node Nehalem cluster.  I am so far happy with how things work.
>> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support
>> where nodes would just start randomly run slow.  Upgrading the kernel
>> fixed that.  But that performance problem was either all or none, I don't recall
>> it exhibiting itself in the way that Rahul described.
>>
> 
> For me it shows:
> 
> Linux version 2.6.18-128.el5 (mockbuild at builder10.centos.org)
> 
> I am a bit confused with the numbering scheme, now. Is this older or
> newer than Craigs? You are right Craig, I haven't noticed any random
> slowdowns but my data is statistically sparse. I only have a single
> Nehalem+CentOS test node right now.
> 

When you run uname -a you don't get something like:

[ctierney at wfe7 serial]$ uname -a
Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux

We did build our kernel from source, only because we ripped out
the IB so we could build from the latest OFED stack.

Try:

# rpm -qa | grep kernel

And see what version is listed.

We have found a few performance problems so far.

1) Nodes would start going slow, really slow.  However, when they started
to go slow they stayed slow and the problem was cleared by a reboot.  This
problem was resolved by upgrading to the kernel we use now.

2) Nodes are reporting too many System Events that look like single-bit
errors.  This again would show up as nodes that would start to go slow, and
wouldn't be resolved until a reboot.  We no longer things we had lots of
bad memory, and the latest BIOS may have fixed it.  We are upload that bios
now and will start checking.

The only time I was getting variability in timings was when I wasn't pinning
processes and memory correctly.  My tests have always used all the cores
in a node though.  I think that OpenMPI is doing the correct thing
with mpi_affinity_alone.  For mvapich, we wrote a wrapper script (similar to
TACC) that uses numactl directly to pin memory and threads.

Craig

-- 
Craig Tierney (craig.tierney at noaa.gov)