[Beowulf] Memory latency (was woodcrest)

Thu Aug 17 14:52:47 PDT 2006

For those interested in latency.

I wrote a pthread based latency tester that will access N integers
randomly per thread.  Each member of the array is accessed once.
All the numbers below are for N=1,000,000 integers.  Every integer is
loaded exactly once, randomly.

The first number is the latency per thread, so it increases with memory
contention.  The second number is the "effective" ns, where I divide
the run time[1] of all threads and divide it by the integers retreived.
It should decrease with increased threads if the machine has the CPU
and memory system parallelism to avoid contention.

                               1 thread         2 threads       4 threads
Dual Opteron 275[2]           83.69ns/83.69ns  80ns/52.08ns    85ns/21.72ns 
Quad opteron 846[3]          108.07/108.07ns  115ns/61.39ns   110ns/27.89ns
Dual Woodcrest-2.66[2]       107.18/107.18ns  108ns/54.03ns   118ns/29.69ns
Dual core amd64-2.2GHz[5]     89.45/89.45ns    89.45ns/44.72  145ns/52.76ns
AMD64 3200[4]-2.0GHz          69.74ns/69.74ns  69ns/69.31ns   137ns/69.85ns
Dual socket nacoma 3.4GHz[6] 130.45/130.45ns  133/66.72ns     230ns/67.72ns
Dual core p4-3.0[6]          115.45/115.46ns  185ns/101.03ns  283ns/92.67ns
Dual it2-1.4GHz[6]           200.47/200.47ns  203ns/101.92ns  362ns/101.57ns

I'm happy to say that Pathscale, Intel, GCC-3, and GCC-4 all share
mostly identical performance.  Although, I had to be very careful with
pathscale to avoid the benchmark routine from getting optimized away.

Anyone have a Rev F opteron handy?

[1] Where runtime = max(finishtimes)-min(starttimes)
[2] Dual socket, dual core = 4 cores
[3] Quad socket, single core = 4 cores
[4] Single core/single socket = 1 core
[5] Dual core/single socket = 2 cores
[6] Dual socket, single core = 2 cores.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis