[Beowulf] latency and bandwidth micro benchmarks

Tue Aug 15 06:02:12 PDT 2006

As has been mentioned here, the canonical bandwidth benchmark is
streams.

AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
the lmbench suite.

Streams is ultimately a test of the bandwidth path between the drams  
and the core
in that if you turn up the buffer size sufficiently high, you will  
overflow any cache.
If you keep turning it up enough above that, you will wash out the  
edge effects
such as not needing to write the dirty cache lines at the end of the  
test.
Secondarily, streams is a compiler test of loop unrolling, software  
pipelining,
and prefetch.

Streams is easy meat for hardware prefetch units, since the access  
patterns are
sequential, but that is OK. It is a bandwidth test.

latency is much harder to get at.  lat_mem_rd tries fairly hard to  
defeat hardware
prefetch units by threading a chain of pointers through a random set  
of cache
blocks.   Other tests that don't do this get screwy results.

lat_mem_rd produces a graph, and it is easy to see the L1, L2, and  
main memory plateaus.

This is all leadup to asking for lat_mem_rd results for Woodcrest  
(and Conroe, if there
are any out there), and for dual-core Opterons (275)

With both streams and lat_mem_rd, one can run one copy or multiple  
copies, or use a
single copy in multithread mode.  Many cited test results I have been  
able to find use
very vague english to describe exactly what they have tested.  I  
prefer running
two copies of stream rather than using OpenMP - I want to measure  
bandwidth, not
inter-core synchronization.  For lat_mem_rd, the -P 2 switch seems  
fine, it just
forks two copies of the test.

I'm interested in results for a single thread, but I am also  
interested in results for
multiple threads on dual-core chips and in machines with multiple  
sockets of single
or dual core chips.

The bandwidth of a two-socket single-core machine, for example,  
should be nearly twice
the bandwidth of a single-socket dual-core machine simply because the  
threads are
using different memory controllers.  Is this borne out by tests?   
Four threads on
a dual-dual should give similar bandwidth per core to a single socket  
dual-core.  True?

Next, considering a dual-core chip, to the extent that a single core  
can saturate the memory
controller, when both cores are active, there should be a substantial  
drop in bandwidth
per core.

Latency is much more difficult.  I would expect that dual-core  
lat_mem_rd results with
both cores active should show only a slight degradation of latency,  
due to occasional
bus contention or resource scheduling conflicts between the cores. A  
single memory
controller should be able to handle pointer chasing activity from  
multiple cores.  True?

Our server farm here is all dual-processor single core (Opteron 248)  
and they seem
to behave as expected: running two copies of stream gives nearly  
double performance,
and the latency degradation due to running two copies of lat_mem_rd  
is nearly
indetectable.  We don't have any dual-core chips or any Intel chips.

-Larry