[Beowulf] AMD performance (was 500GB systems)

Sat Jan 12 19:23:28 PST 2013

On 01/12/2013 07:29 AM, Vincent Diepeveen wrote:
> Yes i was the inventor of that test to jump using a RNG randomly. 
> Paul Hsieh then modified it from calling the RNG and correcting for
> the RNG, to the direct pointer math as you show here.

Oh come now Vincent, inventor is a very strong word for something in
common knowledge for a decades or longer.

Randomly access memory to quantify random memory latency is completely
obvious.  After all the R in RAM stands for Random.

In particular Larry McVoy lmbench series went through several iterations
for measuring latency and clock speed because of improvements in
compilers and smarter CPU prefetching.  I had several discussions with
him about this, not sure if it was in comp.*.sgi or maybe comp.arch.
Even a phone call or two.

In 1996 Larry McVoy defined memory latency, and I quote:
   lmbench measures back-to-back-load latency  because it is the only
   measurement that may be easily measured from software and because we
   feel that it is  what most software developers consider to be memory
   latency. Consider the following C code fragment:
        p = head;
        while (p->p_next)
           p = p->p_next;

Larry doesn't claim this is novel, just that it's what most software
developers would consider main memory latency... in 1996.

> This test is not so strong however when using multiple cores. It's
> only ok for 1 core.

Seems reasonable for multiple cores to me.  Especially since that
for HPC use running 1 or more jobs per CPU core is an extremely common
use case.

> Setting up the pattern you can do way quicker than Paul Hsieh
> proposed. You can directly run around with the RNG setting it up, as
> we can prove that most RNG's that are pretty ok, don't use the built
> in RNG, they run around in O ( n log n ).

Since you only need to outsmart the on chip prefetch that has just 0.5ns
to make a prediction just about anything even vaguely random should
work.  Anything but a fixed stride.

> The distance which you jump around influences the latency a lot that
> you will achieve.

Well there's 3 useful distances:
1) in the same cache line
2) on an open page
3) anywhere else

As long as your requests are all in the "anywhere else" category you are
going to have minimal cache hits and minimal tlb hits.  Intel machines
often have a tweak available in BIOS to automatically prefetch adjacent
cache lines.  So a minimal distance of 2 cache lines will remove this
small factor.  The chances of a random number picking an adjacent cache
lines out of 400GB is pretty close to zero anyways.

> 200 ns is very fast. Yet your test is crappy.
> 
> The glory of DDR3. Yet you write that you visit 3.3 bytes per
> cacheline. Which is less than 16GB of data.

Even 1 bit per cache line is plenty to measure latency, but I'm reading
64 bits per cache line.  In any case the fraction of a cacheline that
you read is unrelated to the latency of loading a random cacheline.

Additionally reading any fraction of the cache lines as long as the some
of them is much larger than the sum of the L3 cache you are going to
have a good measurement of the cache line latency.  Granted you need an
even larger fraction of cache lines if you want to quantify the tlb
limitations.  If you use huge pages you'll need an even larger fraction.

My measurement was the worst case, each cache line could be anywhere in
400GB.

> I'm getting more bytes per jump out of every cacheline and in my test
> of course every core can read from the same cache line like other
> cores can. So i simply splitted it up in 8 bytes chunks the entire
> RAM.

That is different than my test, for multiple threads I give each their
own area of memory.  That doesn't make either implementation more or
less right, they are just measuring different things.

> Parallellizing the above code with direct pointer math is not ok as 
> every core jumps in the same manner. So clever caches will predict
> you.

I use a different random sequence for each core, so no matter how clever
the cache it won't know which address to prefetch until the pending
memory read finishes.

> That's why i initialize every cores RNG with different values and
> jump according to the RNG. Later on i correct for the time it takes
> to calculate the RNG value, which is under 3 ns anyway at most
> architectures. Tad more at itanium as it doesn't know how to rotate,
> a crucial thing for most RNG's!

Ah, maybe that's why your numbers are so far off.  Why not setup the
pointer chain ahead of time?  After all no matter how fast/efficient the
RNG is you are doing more than measuring just memory latency.

> Another issue is the way to measure. I'm not measuring start and stop
> of every core. I let them run for a while, then i take a long
> interval of a second or 30 to measure, have all cores run another 30
> seconds further while the measurement goes so that not a bunch of
> cores can measure while the other cores already finished.
> 
> Cheating not allowed!

Of course, I take the most pessimistic timing, I calculate my timing
from min(list of thread start times) to max (list of thread stop times).
 So no cheating is allowed.