[Beowulf] performance tweaks and optimum memory configs for a Nehalem

Sun Aug 9 19:34:07 PDT 2009

Hi Rahul, list

See answers inline.

Rahul Nabar wrote:
> On Fri, Aug 7, 2009 at 7:58 PM, Gus Correa<gus at ldeo.columbia.edu> wrote:
>> Some people are reporting good results when using the
>> Nehalem hypethreading feature (activated on the BIOS).
>> When the code permits, this virtually doubles the number
>> of cores on Nehalems.
>> That feature works very well on IBM PPC-6 processors
>> (IBM calls it "simultaneous multi-threading" SMT, IIRR),
>> and scales by a factor of >1.5, at least with the atmospheric
>> model I tried.
> 
> 
> Thanks for all the useful comments, Gus!  Hyperthreading is confusing
> the hell out of me. 

So it is to me.
The good news is that according to all reports I read,
hyperthreading in Nehalem works well
(by contrast with the old version on Pentium-4 and
the corresponding Xeons).

I expected to see 8 cores in cat /proc/cpuinfo Now
> I see 16. (This means I must have left hyperthreading on I guess; I
> ought to go to the server room; reboot and check the BIOS)
> 

Most likely it is on.
Maybe it is the BIOS default, or the vendor set it up this way.

Unfortunately I don't have access to the Nehalem machine.
So, I can't check the /proc/cpuinfo here, play with MPI, etc.
I helped a grad student configure it, for his thesis research,
but the researcher who he works for is a PITA.  Bad politics.

> This is confusing my benchmarking too. Let's say I ran an MPI job with
> -np 4. If there was no other job on this machine would hyperthreading
> bring the other CPUs into play as well?
> 

Which MPI do you use?
IIRR, you have Gigabit Ethernet, right? (not Infiniband)

If you use OpenMPI, you can set the processor affinity,
i.e. bind each MPI process to one "processor" (which was once
a CPU, then became a core, and now is probably a virtual
processor associated to the hyperthreaded Nehalem core).
In my experience (and other people's also) this improves
performance.

On the Opteron Shanghais we have, "top" shows the process
number always paired with the "procesor", which in this
case is a core, when processor affinity is set.
I presume with Nehalem the thing will work, although the
processes will be paired with the multithreaded core.

In OpenMPI all this takes is to add the flag:
-mca mpi_paffinity_alone 1
to the mpiexec line.

OpenMPI has finer grained control of processor affinity
through a file where you make the process-to-processor
association.  However, the setup above
may be good enough for jazz, and is quite simple.

Up to MPICH2 1.0.8p1 there was no such a thing in MPICH.
However, I haven't checked their latest greatest version 1.1.
They may have something now.

> The reason I ask is this: I have noticed that a single 4 core job is
> slower than two 4 core jobs run simultanously. This seems puzzling to
> me.
> 

It is possible that this is the result of not setting
processor affinity.
The Linux scheduler may not switch processes
across cores/processors efficiently.

You may check this out by logging in to a node
and using "top", hitting "1" (to show all cores/hyperthreads),
hitting "f" to change the displayed fields,
then hitting "j" (check, not sure if it is "j")
to show the processor/core/hyperthreaded core).

I would guess you can pair 6 hyperthreaded cores on each
socket to 6 processes.
This would give a symmetric
and probably load balanced distribution of work.
This would also handle 12 processes per node, and fully
utilize your 24GB of memory, on your production jobs that
require 2GB/process.
(Not sure you actually have 24GB or 16GB, though.
You didn't say how much memory you bought.)

I would be curious to learn what you get, with processor
affinity on Nehalems.  I would guess it should work,
like in physical cores.
At least on the IBM PPC-6 it does work and improves
the performance.
I read somebody telling that it works well also with Nehalems,
specifically with an ocean model, getting a decent scaling
around 1.4 using 16 processes per node, IIRR.

I hope this helps.

Good luck!

Gus Correa