[Beowulf] bizarre scaling behavior on a Nehalem

Tue Aug 11 10:06:32 PDT 2009

Rahul Nabar wrote:
> Exactly! But I thought this was the big advance with the Nehalem that
> it has removed the CPU<->Cache<->RAM bottleneck.

Not sure I'd say removed, but they have made a huge improvement.  To the point
where a single socket intel is better than a dual socket barcelona.

> So if the code scaled
> with the AMD Barcelona then it would continue to scale with the
> Nehalem right?

That is a gross over simplification.  So sure with a microbenchmark testing
memory bandwidth only that wouldn't be a terrible approximation.  Something
like vasp is far from a simple micro benchmark.

> I'm posting a copy of my scaling plot here if it helps.
> 
> http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg

Looks to me like you fit in the barcelona 512KB L2 cache (and get good
scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling).

Were the binaries compiled specifically to target both architectures?  As a
first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's
compiler for intel.  But portland group does a good job at both in most cases.

> Hyperthreading OFF
> 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration
> X5550

I"m curious about the hyperthreading on data point as well.

> Even if we explained away the bizzare performance of the 4 node case
> to the Turbo effect what is most confusing is how the 8 core data
> point could be so much slower than the corresponding 8 core point on a
> old AMD Barcelona.

A doubling of the can have that effect.  The Intel L3 can no come anywhere
close to feeding 4 cores running flat out.