Athlon memory speed asymmetry

Tue Feb 25 16:20:32 PST 2003

On 25 Feb 2003, Joe Landman wrote:

> Hi Robert:
> 
> I am sure someone has pointed it out to you, but ...
... 
> >  for (i=0; i<size; i+=stride){
> >    aindex = ai[i];
> >  }
> 
> But this one is.  
> 
> What compiler and optimization level do you use, and did you look at the
> disassembled binary to verify that these loops are represented as you
> wrote them?

CFLAGS = -O1 -g $(DEFINES) -fno-inline

and gcc.  According to its documentation, I don't really need the
no-inline statement; the compiler isn't supposed to unroll loops or
inline functions at -O1 OR -O2 (and I see no difference in results as
this changes).

Also both ai and aindex are global variables (or pointers). From my
discussions on this in the past this seems to matter.  The compiler is
less likely to optimize away code that references outside the local
address space of a subroutine, as it cannot be certain that other code
threads aren't "simultaneously" accessing the variable.

I confess I haven't looked at the disassembled code, but the benchmark
in noisy mode:

  a) Does scale the "empty loop" timing with its size in believable
ways, so it is very likely doing the loop itself.

  b) CHANGES the time if I comment out the aindex = ai[i], decreasing by
about a third relative (0.4 nsec) to the loop with the aindex = ai[i] in
it.

  c) CHANGES AGAIN if I put two aindex = ai[i] lines in the loop,
increasing by about a third (0.4 nsec).

So one in principle one can be at least reasonable sure that the
statement isn't going away and takes roughly 0.4 nsec to complete.  This
would be half a clock cycle (?) on a 1.33 GHz system, in L1 cache?

This doesn't mean that your concerns are groundless, however.  One thing
I perpetually struggle with is getting answers for some test not to
depend on the surrounding code, so they can be "definitively correct",
and in particular to get a good subtraction of the "empty" loop times to
be able to get a decent measure of the timing of its contents.

This turns out to be nearly impossible -- sometimes changing the code
seems to change critical alignments so that even though I make a change
in part a) and perhaps get sane changes in timings there, the timings in
part b) suddenly change even though the code THERE didn't change at all.
Then there are all the "interesting" things that can happen if you do
certain benchmarks multiplying things or dividing things by certain
numbers, e.g. 2.0.

In a way, this is all fair, since I generally don't hand optimize (at
the assembler level) production code, and I have found no way to control
alignment in gcc on i386, so one seems to be stuck with whatever the
compiler turns out.

At any rate, I think that the numbers cpu_rate returns are accurate as
in reproducible, and probably not as accurate as I would like in
absolute terms (but not so inaccurate as to be irrelevant).

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu