[Beowulf] latency and bandwidth micro benchmarks
Lawrence Stewart
larry.stewart at sicortex.com
Tue Aug 15 06:02:12 PDT 2006
As has been mentioned here, the canonical bandwidth benchmark is
streams.
AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
the lmbench suite.
Streams is ultimately a test of the bandwidth path between the drams
and the core
in that if you turn up the buffer size sufficiently high, you will
overflow any cache.
If you keep turning it up enough above that, you will wash out the
edge effects
such as not needing to write the dirty cache lines at the end of the
test.
Secondarily, streams is a compiler test of loop unrolling, software
pipelining,
and prefetch.
Streams is easy meat for hardware prefetch units, since the access
patterns are
sequential, but that is OK. It is a bandwidth test.
latency is much harder to get at. lat_mem_rd tries fairly hard to
defeat hardware
prefetch units by threading a chain of pointers through a random set
of cache
blocks. Other tests that don't do this get screwy results.
lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
main memory plateaus.
This is all leadup to asking for lat_mem_rd results for Woodcrest
(and Conroe, if there
are any out there), and for dual-core Opterons (275)
With both streams and lat_mem_rd, one can run one copy or multiple
copies, or use a
single copy in multithread mode. Many cited test results I have been
able to find use
very vague english to describe exactly what they have tested. I
prefer running
two copies of stream rather than using OpenMP - I want to measure
bandwidth, not
inter-core synchronization. For lat_mem_rd, the -P 2 switch seems
fine, it just
forks two copies of the test.
I'm interested in results for a single thread, but I am also
interested in results for
multiple threads on dual-core chips and in machines with multiple
sockets of single
or dual core chips.
The bandwidth of a two-socket single-core machine, for example,
should be nearly twice
the bandwidth of a single-socket dual-core machine simply because the
threads are
using different memory controllers. Is this borne out by tests?
Four threads on
a dual-dual should give similar bandwidth per core to a single socket
dual-core. True?
Next, considering a dual-core chip, to the extent that a single core
can saturate the memory
controller, when both cores are active, there should be a substantial
drop in bandwidth
per core.
Latency is much more difficult. I would expect that dual-core
lat_mem_rd results with
both cores active should show only a slight degradation of latency,
due to occasional
bus contention or resource scheduling conflicts between the cores. A
single memory
controller should be able to handle pointer chasing activity from
multiple cores. True?
Our server farm here is all dual-processor single core (Opteron 248)
and they seem
to behave as expected: running two copies of stream gives nearly
double performance,
and the latency degradation due to running two copies of lat_mem_rd
is nearly
indetectable. We don't have any dual-core chips or any Intel chips.
-Larry
More information about the Beowulf
mailing list