[Beowulf] Memory latency (was woodcrest)
Bill Broadley
bill at cse.ucdavis.edu
Thu Aug 17 14:52:47 PDT 2006
For those interested in latency.
I wrote a pthread based latency tester that will access N integers
randomly per thread. Each member of the array is accessed once.
All the numbers below are for N=1,000,000 integers. Every integer is
loaded exactly once, randomly.
The first number is the latency per thread, so it increases with memory
contention. The second number is the "effective" ns, where I divide
the run time[1] of all threads and divide it by the integers retreived.
It should decrease with increased threads if the machine has the CPU
and memory system parallelism to avoid contention.
1 thread 2 threads 4 threads
Dual Opteron 275[2] 83.69ns/83.69ns 80ns/52.08ns 85ns/21.72ns
Quad opteron 846[3] 108.07/108.07ns 115ns/61.39ns 110ns/27.89ns
Dual Woodcrest-2.66[2] 107.18/107.18ns 108ns/54.03ns 118ns/29.69ns
Dual core amd64-2.2GHz[5] 89.45/89.45ns 89.45ns/44.72 145ns/52.76ns
AMD64 3200[4]-2.0GHz 69.74ns/69.74ns 69ns/69.31ns 137ns/69.85ns
Dual socket nacoma 3.4GHz[6] 130.45/130.45ns 133/66.72ns 230ns/67.72ns
Dual core p4-3.0[6] 115.45/115.46ns 185ns/101.03ns 283ns/92.67ns
Dual it2-1.4GHz[6] 200.47/200.47ns 203ns/101.92ns 362ns/101.57ns
I'm happy to say that Pathscale, Intel, GCC-3, and GCC-4 all share
mostly identical performance. Although, I had to be very careful with
pathscale to avoid the benchmark routine from getting optimized away.
Anyone have a Rev F opteron handy?
[1] Where runtime = max(finishtimes)-min(starttimes)
[2] Dual socket, dual core = 4 cores
[3] Quad socket, single core = 4 cores
[4] Single core/single socket = 1 core
[5] Dual core/single socket = 2 cores
[6] Dual socket, single core = 2 cores.
--
Bill Broadley
Computational Science and Engineering
UC Davis
More information about the Beowulf
mailing list