[Beowulf] bizarre scaling behavior on a Nehalem
Bill Broadley
bill at cse.ucdavis.edu
Wed Aug 12 08:14:09 PDT 2009
I've been working on a pthread memory benchmark that is loosely modeled on
McCalpin's stream. It's been quite a challenge to remove all the noise/lost
performance from the benchmark to get close to performance I expected. Some
of the obstacles:
* For the compilers that tend to be better at stream (open64 and pathscale),
you lose the performance if you just replace double a[],b[],c[] with
double *a,*b,*c. Patch[1] available. I don't have a work around for
this, suggestions welcome. Is it really necessary for dynamic arrays
to be substantially slower than static?
* You have to be very careful with pointer alignment both with cache lines,
and each other
* cpu_affinity (by CPU id)
* numa (by socket id)
The results are relatively smooth graphs, here's an example, it's uselessly
busy until you toggle off a few graphs (by clicking on the key):
http://cse.ucdavis.edu/bill/pstream.svg
The biggest puzzle I have now is what the previous generation intel quads, the
current generation AMD quads, and numerous other CPUs show a big benefit in
L1, while the nehalem shows no benefit.
[1] http://cse.ucdavis.edu/bill/stream-malloc.patch
More information about the Beowulf
mailing list