[Beowulf] latency and bandwidth micro benchmarks
Bill Broadley
bill at cse.ucdavis.edu
Mon Aug 28 22:47:51 PDT 2006
On Tue, Aug 15, 2006 at 09:02:12AM -0400, Lawrence Stewart wrote:
> As has been mentioned here, the canonical bandwidth benchmark is
> streams.
Agreed.
> AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
> the lmbench suite.
Really? Seems like more of a prefetch test then a latency benchmark.
A fixed stride allows a guess at where the n+1 address before the n'th
address is loaded.
I ran the full lmbench:
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
amd-2214 Linux 2.6.9-3 x86_64-linux-gnu 2199 32 128 4.4800 1
xeon-5150 Linux 2.6.9-3 x86_64-linux-gnu 2653 8 128 5.5500 1
Strangely, the linux kernel disagrees on the cache line size for the amd
(from dmesg):
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
> Secondarily, streams is a compiler test of loop unrolling, software
> pipelining, and prefetch.
Indeed.
> Streams is easy meat for hardware prefetch units, since the access
> patterns are
> sequential, but that is OK. It is a bandwidth test.
Agreed.
> latency is much harder to get at. lat_mem_rd tries fairly hard to
> defeat hardware
> prefetch units by threading a chain of pointers through a random set
> of cache
> blocks. Other tests that don't do this get screwy results.
A random set of cache blocks?
You mean:
http://www.bitmover.com/lmbench/
I got the newest lmbench3.
The benchmark runs as two nested loops. The outer loop is the stride
size. The inner loop is the array size.
The memory results:
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses
--------- ------------- --- ---- ---- -------- -------- -------
amd-2214 Linux 2.6.9-3 2199 1.3650 5.4940 68.4 111.3
xeon-5150 Linux 2.6.9-3 2653 1.1300 5.3000 101.5 114.2
> lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
> main memory plateaus.
>
> This is all leadup to asking for lat_mem_rd results for Woodcrest
> (and Conroe, if there
> are any out there), and for dual-core Opterons (275)
The above amd-2214 is the ddr2 version of the opteron 275.
My latency numbers with plat are 98.5ns for a 38MB array. A bit better than
lmbench.
> With both streams and lat_mem_rd, one can run one copy or multiple
> copies, or use a
> single copy in multithread mode. Many cited test results I have been
> able to find use
> very vague english to describe exactly what they have tested. I
My code is pretty simple, for an array of N ints I do:
while (p != 0)
{
p = a[p];
}
That to me is random memory latency. Although doing a 2 stage loop
for 0 to N pages
pick a random page
for 0 to M (cachelines per page)
pick a random cacheline
Would minimize time spent with the page overhead.
> prefer running
> two copies of stream rather than using OpenMP - I want to measure
> bandwidth, not
> inter-core synchronization.
I prefer is synchronized. Otherwise 2 streams might get out of sync, and
while one gets 8GB/sec, and another gets 8GB/sec, they didn't do it at the
same time. In my benchmark I take the min of all start times and the max
of all stop times. That way there is no cheating.
> I'm interested in results for a single thread, but I am also
> interested in results for
> multiple threads on dual-core chips and in machines with multiple
> sockets of single
> or dual core chips.
Since your spending most of your time waiting on dram, there isn't much
contention:
http://cse.ucdavis.edu/~bill/intel-1vs4t.png
> The bandwidth of a two-socket single-core machine, for example,
> should be nearly twice
> the bandwidth of a single-socket dual-core machine simply because the
> threads are
> using different memory controllers.
Judge for yourself:
http://cse.ucdavis.edu/~bill/quad-numa.png (quad opteron)
http://cse.ucdavis.edu/~bill/altix-dplace.png
http://cse.ucdavis.edu/~bill/intel-5150.png (woodcrest + ddr2-667)
> Is this borne out by tests?
> Four threads on
> a dual-dual should give similar bandwidth per core to a single socket
> dual-core. True?
Yes, alas I don't have graphs of single socket dual core systems
handy.
> Next, considering a dual-core chip, to the extent that a single core
> can saturate the memory
> controller, when both cores are active, there should be a substantial
> drop in bandwidth
> per core.
Right.
> Latency is much more difficult. I would expect that dual-core
> lat_mem_rd results with
> both cores active should show only a slight degradation of latency,
> due to occasional
> bus contention or resource scheduling conflicts between the cores. A
> single memory
> controller should be able to handle pointer chasing activity from
> multiple cores. True?
Right, see above graphs for 1 vs 4t.
> Our server farm here is all dual-processor single core (Opteron 248)
> and they seem
> to behave as expected: running two copies of stream gives nearly
> double performance,
> and the latency degradation due to running two copies of lat_mem_rd
> is nearly
> indetectable. We don't have any dual-core chips or any Intel chips.
Right.
--
Bill Broadley
Computational Science and Engineering
UC Davis
More information about the Beowulf
mailing list