[Beowulf] bizarre scaling behavior on a Nehalem

Wed Aug 12 08:14:09 PDT 2009

I've been working on a pthread memory benchmark that is loosely modeled on
McCalpin's stream.  It's been quite a challenge to remove all the noise/lost
performance from the benchmark to get close to performance I expected.  Some
of the obstacles:
* For the compilers that tend to be better at stream (open64 and pathscale),
  you lose the performance if you just replace double a[],b[],c[] with
  double *a,*b,*c. Patch[1] available.  I don't have a work around for
  this, suggestions welcome.  Is it really necessary for dynamic arrays
  to be substantially slower than static?
* You have to be very careful with pointer alignment both with cache lines,
  and each other
* cpu_affinity (by CPU id)
* numa (by socket id)

The results are relatively smooth graphs, here's an example, it's uselessly
busy until you toggle off a few graphs (by clicking on the key):

http://cse.ucdavis.edu/bill/pstream.svg

The biggest puzzle I have now is what the previous generation intel quads, the
current generation AMD quads, and numerous other CPUs show a big benefit in
L1, while the nehalem shows no benefit.

[1] http://cse.ucdavis.edu/bill/stream-malloc.patch