[Beowulf] latency and bandwidth micro benchmarks

Mon Aug 28 22:47:51 PDT 2006

On Tue, Aug 15, 2006 at 09:02:12AM -0400, Lawrence Stewart wrote:
> As has been mentioned here, the canonical bandwidth benchmark is
> streams.

Agreed.

> AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
> the lmbench suite.

Really?  Seems like more of a prefetch test then a latency benchmark.
A fixed stride allows a guess at where the n+1 address before the n'th
address is loaded.

I ran the full lmbench:
Host                 OS Description              Mhz  tlb  cache  mem   scal
                                                     pages line   par   load
                                                           bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
amd-2214  Linux 2.6.9-3        x86_64-linux-gnu 2199    32   128 4.4800    1
xeon-5150 Linux 2.6.9-3        x86_64-linux-gnu 2653     8   128 5.5500    1

Strangely, the linux kernel disagrees on the cache line size for the amd 
(from dmesg):
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)

> Secondarily, streams is a compiler test of loop unrolling, software  
> pipelining, and prefetch.

Indeed.

> Streams is easy meat for hardware prefetch units, since the access  
> patterns are
> sequential, but that is OK. It is a bandwidth test.

Agreed.

> latency is much harder to get at.  lat_mem_rd tries fairly hard to  
> defeat hardware
> prefetch units by threading a chain of pointers through a random set  
> of cache
> blocks.   Other tests that don't do this get screwy results.

A random set of cache blocks?

You mean:
http://www.bitmover.com/lmbench/

I got the newest lmbench3.
       The  benchmark  runs as two nested loops.  The outer loop is the stride
       size.  The inner loop is the array size.

The memory results:
Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host                 OS   Mhz   L1 $   L2 $    Main mem    Rand mem    Guesses
--------- -------------   ---   ----   ----    --------    --------    -------
amd-2214  Linux 2.6.9-3  2199 1.3650 5.4940   68.4       111.3
xeon-5150 Linux 2.6.9-3  2653 1.1300 5.3000  101.5       114.2

> lat_mem_rd produces a graph, and it is easy to see the L1, L2, and  
> main memory plateaus.
> 
> This is all leadup to asking for lat_mem_rd results for Woodcrest  
> (and Conroe, if there
> are any out there), and for dual-core Opterons (275)

The above amd-2214 is the ddr2 version of the opteron 275.

My latency numbers with plat are 98.5ns for a 38MB array.  A bit better than
lmbench.

> With both streams and lat_mem_rd, one can run one copy or multiple  
> copies, or use a
> single copy in multithread mode.  Many cited test results I have been  
> able to find use
> very vague english to describe exactly what they have tested.  I  

My code is pretty simple, for an array of N ints I do:
  while (p != 0)
    {
      p = a[p];
    }

That to me is random memory latency.  Although doing a 2 stage loop
for 0 to N pages
   pick a random page
   for 0 to M (cachelines per page)
       pick a random cacheline

Would minimize time spent with the page overhead.

> prefer running
> two copies of stream rather than using OpenMP - I want to measure  
> bandwidth, not
> inter-core synchronization. 

I prefer is synchronized.  Otherwise 2 streams might get out of sync, and
while one gets 8GB/sec, and another gets 8GB/sec, they didn't do it at the
same time.   In my benchmark I take the min of all start times and the max
of all stop times.  That way there is no cheating. 

> I'm interested in results for a single thread, but I am also  
> interested in results for
> multiple threads on dual-core chips and in machines with multiple  
> sockets of single
> or dual core chips.

Since your spending most of your time waiting on dram, there isn't much
contention:
	http://cse.ucdavis.edu/~bill/intel-1vs4t.png

> The bandwidth of a two-socket single-core machine, for example,  
> should be nearly twice
> the bandwidth of a single-socket dual-core machine simply because the  
> threads are
> using different memory controllers.

Judge for yourself:
	http://cse.ucdavis.edu/~bill/quad-numa.png (quad opteron)
    http://cse.ucdavis.edu/~bill/altix-dplace.png 
	http://cse.ucdavis.edu/~bill/intel-5150.png (woodcrest + ddr2-667)

>  Is this borne out by tests?   
> Four threads on
> a dual-dual should give similar bandwidth per core to a single socket  
> dual-core.  True?

Yes, alas I don't have graphs of single socket dual core systems
handy.

> Next, considering a dual-core chip, to the extent that a single core  
> can saturate the memory
> controller, when both cores are active, there should be a substantial  
> drop in bandwidth
> per core.

Right.

> Latency is much more difficult.  I would expect that dual-core  
> lat_mem_rd results with
> both cores active should show only a slight degradation of latency,  
> due to occasional
> bus contention or resource scheduling conflicts between the cores. A  
> single memory
> controller should be able to handle pointer chasing activity from  
> multiple cores.  True?

Right, see above graphs for 1 vs 4t.

> Our server farm here is all dual-processor single core (Opteron 248)  
> and they seem
> to behave as expected: running two copies of stream gives nearly  
> double performance,
> and the latency degradation due to running two copies of lat_mem_rd  
> is nearly
> indetectable.  We don't have any dual-core chips or any Intel chips.

Right.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis