[Beowulf] bizarre scaling behavior on a Nehalem

Tue Aug 11 15:57:11 PDT 2009

2009/8/11 Rahul Nabar <rpnabar at gmail.com>

> On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley<bill at cse.ucdavis.edu>
> wrote:
> > Looks to me like you fit in the barcelona 512KB L2 cache (and get good
> > scaling) and do not fit in the nehalem 256KB L2 cache (and get poor
> scaling).
>
> Thanks Bill! I never realized that the L2 cache of the Nehalem is
> actually smaller than that of the Barcelona!
>
> I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe.
> THe size of the L2 cache is fixed across the steppings of the Nehlem
> isn't it?

I think that probably it only will be fixed on newer models or only in
Westmere (Nehalem shrink to 32nm).

>
>
> > Were the binaries compiled specifically to target both architectures?  As
> a
> > first guess I suggest trying pathscale (RIP) or open64 for amd, and
> intel's
> > compiler for intel.  But portland group does a good job at both in most
> cases.
>
> We used the intel compilers. One of my fellow grad students did the
> actual compilation for VASP but I believe he used the "correct" [sic]
> flags to the best of our knowledge. I could post them on the list
> perhaps. There was no cross-compilation. We compiled a fresh binary
> for the Nehalem.
>
> > I"m curious about the hyperthreading on data point as well.
>
> Didn't test for VASP yet but for our other two DFT codes i.e. DACAPO
> and GPAW hyperthreading "off" seems to be about 10% faster.
>
>
> > A doubling of the can have that effect.  The Intel L3 can no come
> anywhere
> > close to feeding 4 cores running flat out.
>
> Could you explain this more? I am a little lost with the processor
> dynamics. Does this mean using a quad core for HPC on the Nehlem is
> not likely to work well for scaling? Or do you imply a solution so
> that I could fix this somehow?
>

Nehalem and Barcelona have the following cache architecture:

L1 cache: 64KB (32kb data, 32kb instruction), per core
L2 cache: Barcelona :512kb, Nehalem: 256kb, per core
L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores.

Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like
L3 and memory controllers) runs at lower speed than the cores and all cores
communicate through L3, so it must handle some coherence signals too.
This makes impossible to L3 feed all cores at full speed if L2 caches have
big miss ratios.

So, what is happening with your program is something like:

Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate,
but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate.
So in Nehelem the shared L3 cache has to handle much more requests from all
cores than Barcelona, becoming a big bottleneck.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090811/1bf11556/attachment.html>