Athlon memory speed asymmetry

Tue Feb 25 18:12:55 PST 2003

On Tue, 2003-02-25 at 19:20, Robert G. Brown wrote:

> So one in principle one can be at least reasonable sure that the
> statement isn't going away and takes roughly 0.4 nsec to complete.  This
> would be half a clock cycle (?) on a 1.33 GHz system, in L1 cache?

Possibly.  Many factors to consider (prefetch, etc).  Precision in these
things involves some deep hardware/assembler/state dives.

Have a look at Troy Baer's excellent lperfex, and the oprofile work to
give you at least a "grand canonical" view of the code.  Often times
with perfex on the SGI systems, I was able to catch interesting
behavior.  I have not used the lperfex as extensively as perfex, but it
is a good tool to understand what is happening in the code (and enable
even more precise experimentation of the black-box variety that you have
engaged in).  Timing calipers are nice, and deltas as a coarse function
of the extra code are good.  Lperfex and its ilk lets you measure what
is happening via the processor feature counters.  Quite useful if you
are working on a deep understanding of a code.

> This doesn't mean that your concerns are groundless, however.  One thing
> I perpetually struggle with is getting answers for some test not to
> depend on the surrounding code, so they can be "definitively correct",
> and in particular to get a good subtraction of the "empty" loop times to
> be able to get a decent measure of the timing of its contents.

A good optimizer will make the empty loop go away.  It might be better
(e.g. force a  to insert a non-optimizable section into both loops. 

long long big=0;
for ()
  {
   /* ... normal code */

   /* non-optimizable stuff */
   big += aindex;
  }
 printf("big = %i\n",big);
 for ()
  {
   /* non-optimizable stuff */
   big += aindex;
  }
 printf("big = %i\n",big);

This is quite similar in concept to what John McCalpin did in streams. 
I believe that he found that the optimizers made the loops go away
unless he did this.  Slight overhead, but the same overhead for each
loop.  Made the measurements more accurate w.r.t. the optimizers.

> This turns out to be nearly impossible -- sometimes changing the code
> seems to change critical alignments so that even though I make a change
> in part a) and perhaps get sane changes in timings there, the timings in
> part b) suddenly change even though the code THERE didn't change at all.
> Then there are all the "interesting" things that can happen if you do
> certain benchmarks multiplying things or dividing things by certain
> numbers, e.g. 2.0.

Alignments....   have you looked at the (old -malign-double)
-falign-double, -falign-loops, -falign-functions?  see
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize%20Options for details.  Sometimes they work well, sometimes not so well.  

The great thing is that (unlike RISC) you don't get traps on unaligned
accesses.  The terrible thing is that you insert processor scheduler
bubbles and waits while it is waiting for the rest of the data.  It is
hard to force this to be deterministic with gcc.     

-- 
Joseph Landman <landman at scalableinformatics.com>