[Beowulf] latency and bandwidth micro benchmarks

Tue Aug 29 04:28:46 PDT 2006

On Mon, 28 Aug 2006, Bill Broadley wrote:

>> latency is much harder to get at.  lat_mem_rd tries fairly hard to
>> defeat hardware
>> prefetch units by threading a chain of pointers through a random set
>> of cache
>> blocks.   Other tests that don't do this get screwy results.
>
> A random set of cache blocks?
>
> You mean:
> http://www.bitmover.com/lmbench/
>
> I got the newest lmbench3.
>       The  benchmark  runs as two nested loops.  The outer loop is the stride
>       size.  The inner loop is the array size.
>
> The memory results:
> Memory latencies in nanoseconds - smaller is better
>    (WARNING - may not be correct, check graphs)
> ------------------------------------------------------------------------------
> Host                 OS   Mhz   L1 $   L2 $    Main mem    Rand mem    Guesses
> --------- -------------   ---   ----   ----    --------    --------    -------
> amd-2214  Linux 2.6.9-3  2199 1.3650 5.4940   68.4       111.3
> xeon-5150 Linux 2.6.9-3  2653 1.1300 5.3000  101.5       114.2

In benchmaster there is a memory read test that does indeed have a
random "shuffled access order" option.  It begins by filling a vector
with its own index list, then shuffling its contents.  Then it reads its
way through the list, using each index that it reads as the index of the
next read.  It shows a really dramatic difference in timing, as in (on
my admittedly slow and plodgy 1.87 MHz laptop) 0.4 nsec in streaming
access mode and 70 nsec in random access mode.  Yes, that is seven-zero.

Note that this is for a fairly long main (target) block (10,000,000
ints), and that the code is identical in the two cases -- the streaming
access mode does the same test but omits the pre-test shuffle.  This
defeats both prefetch and cache.

Make of it what you will.  Modern hardware "likes" streaming access.
Barring that, it likes local access, that fits into L2.  When your code
bounces all over God's own kindgom with its memory accesses, the very
prefetch units that optimize streaming very likely work against you,
constantly pulling in more memory than you are actually going to use in
the vicinity of the last memory reference and having to flush it all
away as it eventually languishes and the cache space is reclaimed.

    rgb

> My code is pretty simple, for an array of N ints I do:
>  while (p != 0)
>    {
>      p = a[p];
>    }
>
> That to me is random memory latency.  Although doing a 2 stage loop
> for 0 to N pages
>   pick a random page
>   for 0 to M (cachelines per page)
>       pick a random cacheline
>
> Would minimize time spent with the page overhead.

Ya, something like this but with a[p] shuffled.  I'm trying to remember
if I wrote the shuffle in such a way as to guarantee no loops (all
elements visited exactly one time) and think that I did but would have
to check the source to be sure (even though it is my own source, sigh).
If I ever do get back to work on benchmaster, I'll definitely make it
work this way if it doesn't already.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu