[Beowulf] latency and bandwidth micro benchmarks

Tue Aug 29 07:36:39 PDT 2006

On Aug 29, 2006, at 1:47 AM, Bill Broadley wrote:

> On Tue, Aug 15, 2006 at 09:02:12AM -0400, Lawrence Stewart wrote
>
>> AFAIK, the canonical latency benchmark is lat_mem_rd, which is  
>> part of
>> the lmbench suite.
>
> Really?  Seems like more of a prefetch test then a latency benchmark.
> A fixed stride allows a guess at where the n+1 address before the n'th
> address is loaded.

So after some study of the lmbench sources...  The basic idea is to  
follow a chain
of pointers, causing the loads to be serialized.  There are three  
different
initialization routines:

stride_initialize  - steps through memory in a predictable pattern
thrash_initialize - random order of cache lines for the entire block,  
which can (should)
   cause both a TLB miss and a cache miss on entry load.
mem_initialize - threads through each cache line on a page in a  
random order before
   going to the next line

Evidently the mem_initialize routine was the one I was thinking of.  
It seems to be used
by lat_dram_page rather than by lat_mem_rd.  I'll stare at this some  
more.  So far I am
having trouble getting gmake's attention.

Does your program have just one touch of each cache block?  Or does  
it, in random order,
touch all the words in the line?  The latter case should get a  
somewhat lower access
time than the latency all the way to the drams.

>
> I ran the full lmbench:
> Host                 OS Description              Mhz  tlb  cache   
> mem   scal
>                                                      pages line    
> par   load
>                                                            bytes
> --------- ------------- ----------------------- ---- ----- -----  
> ------ ----
> amd-2214  Linux 2.6.9-3        x86_64-linux-gnu 2199    32   128  
> 4.4800    1
> xeon-5150 Linux 2.6.9-3        x86_64-linux-gnu 2653     8   128  
> 5.5500    1
>
> Strangely, the linux kernel disagrees on the cache line size for  
> the amd
> (from dmesg):
> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
> CPU: L2 Cache: 1024K (64 bytes/line)
>
>> Secondarily, streams is a compiler test of loop unrolling, software
>> pipelining, and prefetch.
>
> Indeed.
>
>> Streams is easy meat for hardware prefetch units, since the access
>> patterns are
>> sequential, but that is OK. It is a bandwidth test.
>
> Agreed.
>
>> latency is much harder to get at.  lat_mem_rd tries fairly hard to
>> defeat hardware
>> prefetch units by threading a chain of pointers through a random set
>> of cache
>> blocks.   Other tests that don't do this get screwy results.
>
> A random set of cache blocks?
exactly.  Also, the prefetcher isn't exactly useless, since there is  
some chance that
the prefetch will load a line that hasn't yet been touched, and that  
won't be evicted before
it is used.

The inner loop is an unwound p = (char **) *p;
>
> You mean:
> http://www.bitmover.com/lmbench/
>
> I got the newest lmbench3.
>        The  benchmark  runs as two nested loops.  The outer loop is  
> the stride
>        size.  The inner loop is the array size.
>
> The memory results:
> Memory latencies in nanoseconds - smaller is better
>     (WARNING - may not be correct, check graphs)
> ---------------------------------------------------------------------- 
> --------
> Host                 OS   Mhz   L1 $   L2 $    Main mem    Rand  
> mem    Guesses
> --------- -------------   ---   ----   ----    --------     
> --------    -------
> amd-2214  Linux 2.6.9-3  2199 1.3650 5.4940   68.4       111.3
> xeon-5150 Linux 2.6.9-3  2653 1.1300 5.3000  101.5       114.2
>
>> lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
>> main memory plateaus.
>>
>> This is all leadup to asking for lat_mem_rd results for Woodcrest
>> (and Conroe, if there
>> are any out there), and for dual-core Opterons (275)
>
> The above amd-2214 is the ddr2 version of the opteron 275.
>
> My latency numbers with plat are 98.5ns for a 38MB array.  A bit  
> better than
> lmbench.
>
>> With both streams and lat_mem_rd, one can run one copy or multiple
>> copies, or use a
>> single copy in multithread mode.  Many cited test results I have been
>> able to find use
>> very vague english to describe exactly what they have tested.  I
>
> My code is pretty simple, for an array of N ints I do:
>   while (p != 0)
>     {
>       p = a[p];
>     }
>
> That to me is random memory latency.  Although doing a 2 stage loop
> for 0 to N pages
>    pick a random page
>    for 0 to M (cachelines per page)
>        pick a random cacheline
>
> Would minimize time spent with the page overhead.
>
>
>> prefer running
>> two copies of stream rather than using OpenMP - I want to measure
>> bandwidth, not
>> inter-core synchronization.
>
> I prefer is synchronized.  Otherwise 2 streams might get out of  
> sync, and
> while one gets 8GB/sec, and another gets 8GB/sec, they didn't do it  
> at the
> same time.   In my benchmark I take the min of all start times and  
> the max
> of all stop times.  That way there is no cheating.
>
>> I'm interested in results for a single thread, but I am also
>> interested in results for
>> multiple threads on dual-core chips and in machines with multiple
>> sockets of single
>> or dual core chips.
>
> Since your spending most of your time waiting on dram, there isn't  
> much
> contention:
> 	http://cse.ucdavis.edu/~bill/intel-1vs4t.png
>
>
>> The bandwidth of a two-socket single-core machine, for example,
>> should be nearly twice
>> the bandwidth of a single-socket dual-core machine simply because the
>> threads are
>> using different memory controllers.
>
> Judge for yourself:
> 	http://cse.ucdavis.edu/~bill/quad-numa.png (quad opteron)
>     http://cse.ucdavis.edu/~bill/altix-dplace.png
> 	http://cse.ucdavis.edu/~bill/intel-5150.png (woodcrest + ddr2-667)
> 	
>>  Is this borne out by tests?
>> Four threads on
>> a dual-dual should give similar bandwidth per core to a single socket
>> dual-core.  True?
>
> Yes, alas I don't have graphs of single socket dual core systems
> handy.
>
>> Next, considering a dual-core chip, to the extent that a single core
>> can saturate the memory
>> controller, when both cores are active, there should be a substantial
>> drop in bandwidth
>> per core.
>
> Right.
>
>> Latency is much more difficult.  I would expect that dual-core
>> lat_mem_rd results with
>> both cores active should show only a slight degradation of latency,
>> due to occasional
>> bus contention or resource scheduling conflicts between the cores. A
>> single memory
>> controller should be able to handle pointer chasing activity from
>> multiple cores.  True?
>
> Right, see above graphs for 1 vs 4t.
>
>> Our server farm here is all dual-processor single core (Opteron 248)
>> and they seem
>> to behave as expected: running two copies of stream gives nearly
>> double performance,
>> and the latency degradation due to running two copies of lat_mem_rd
>> is nearly
>> indetectable.  We don't have any dual-core chips or any Intel chips.
>
> Right.
>
> -- 
> Bill Broadley
> Computational Science and Engineering
> UC Davis