IBM ASCI White & FLOPS algorithm -

Fri Jul 7 12:18:32 PDT 2000

On Fri, 7 Jul 2000, Lechner, David wrote:

> Hi Dr. Brown - 
> We are going to coded this, plan to add step that increases Size to show the
> effect of passing the Cache limit - 
> Could you provide the actual Athlon results though? (you referenced them but
> they appear not to be in the bottom of the note). 
> Could you also clarify an example of Size and count for a 256 KB cache -  I
> am also a bit confused by the "COUNT*SIZE=250 million of the following in a
> second" - 
> do we measure the elapsed time to do {250 million/size} operations of the 4
> ops  per count ? (note 4 = mult., add, div., sub) - so results should be in
> GFlops once the division is done (flops/time) ?
> 
> W/Regards/
>  Dave Lechner

I'm in the middle of making this into a very nice little routine that
does the whole thing -- the arithmetic in what I posted was done by hand
and I didn't get really accurate deltas with gettimeofday().
Unfortunately, I got sidetracked cleaning up a feature of lmbench and
haven't finished it.  Let's see...ok, I took the time and finished it
off (still somewhat crudely).

I hereby cast this out to the world.  It is still highly experimental -
even with the "-r" option, for example, there is considerable variance
from run to run, so I'll probably arrange for multiple calls from a
script to do the averaging and standard deviation instead of (as many)
multiple calls within the binary, which don't appear to produce much
variance even with a full second of sleep in between.

One thing it already does is allow you to see in gory detail the dropoff
in performance when the L1 cache boundaries are crossed.  For example,
with -s 1000 (which should fit in cache) my 300 MHz PII produces a
creditable ~200 BogoMFLOPS.  Up the size to 100000 (which doesn't fit
into cache) and it drops to around 40 BogoMFLOPS.  Oh, well.  This is
the kind of differential that makes ATLAS a good idea and suggests that
we invent some standard system autotuning parameters (perhaps to
publish in /proc) to use to drive portable autotuned code.

Just a thought, of course...

   rgb

P.S. -- This is a total ALPHA of the code, although it seems to work
pretty well.  I plan to add separate tests of the trancendentals
(possibly in separate tools) a la savage.  This is NOT intended to ever
do complex measurements -- only simple/atomic ones.

> 
> -----Original Message-----
> From: Robert G. Brown [mailto:rgb at phy.duke.edu]
> Sent: Thursday, July 06, 2000 2:45 PM
> To: Greg Lindahl
> Cc: Steven Timm; beowulf at beowulf.org
> Subject: RE: IBM ASCI White
> 
> 
> On Thu, 6 Jul 2000, Greg Lindahl wrote:
> 
> > > The $/GFlop is pretty good too, $8.9K/Gflop.  Has anyone
> > > beat this?
> > 
> > Of course -- it's hard to build a cluster that costs that much for list
> > price. The FSL system was cheaper than that.
> 
> I'd second this.  Whatever a "GFLOP" is in a cluster environment
> (apparently, by agreement, the simple aggregate sum of the single CPU
> GFLOPs, which didn't mean much in the first place;-).  900 MHz Athlons
> can deliver just about exactly a billion floats per second for simple
> loops that live in cache (see below), for a node cost of order $1-1.5K.
> Dual high clock PIII's can likely deliver just about a billion floats
> per second for perhaps $2.5K (I'm not being very careful in getting
> absolutely current pricing, so don't sue me if I'm off by a few hundred
> dollars).
> 
> Oh -- my definition of a GFLOP is COUNT*SIZE=250 million of the
> following in a second:
> 
> for(i=0;i<SIZE;i++){
>   x[i] = 1.0;
> }
> 
> for(k=1;k<=COUNT;k++){
>         for(i=0;i<SIZE;i++){
>                 x[i] = (1.0 + x[i])*(2.0 - x[i])/2.0;
>         }
> }
> 
> Which is one addition, one multiplication, one subtraction and one
> division (plus loop and address arithmetic that I try to compensate for
> in the timing) and is stable (the final x[i] = 1.0) within system
> roundoff so you can do it a lot of times.  SIZE needs to be small enough
> for x[i] to fit into L1 cache.  Then one can figure out the effect of
> cache, and so forth.
> 
> Sure, it's not a LINPACK GFLOP.  Nor does it tell one much about e.g.
> trancendentals, the effect of L1 and L2 cache speeds and latencies and
> main memory bandwidths and latencies, the effect of context switches and
> much much more.  However, it is very close to what most people think of
> when they speak of a "floating point operation" and it is a real-world
> measurement made with actual compiled code, not a theoretical peak.
> FWIW, a 400 MHz PII comes in at about 250 MFLOPS and a 667 MHz alpha
> comes in about 800 MFLOPS (using the digital compiler).
> 
> Perhaps we're not QUITE at 1 MFLOP/dollar yet.  However, we're within
> spitting distance of it -- even if one (cynically enough;-) degrades
> these measurements by a factor of two or four we're within a factor of
> two to four.  So the IBM price/performance is high by (surprise)
> approximately a factor of 2-4, depending of course on what the GFLOP
> rating is with this simple measure.
> 
> [Those who don't like my GFLOPS are welcome to dislike them, BTW -- I'm
> only moderately fond of them myself.  Standard Disclaimer:  The only
> meaningful benchmark is YOUR APPLICATION.  The rest of them are just tea
> leaves.]
> 
>    rgb
> 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> (Note -- I subtract out the empty loop time).
> 
> #============================================================
> # cbench benchmark run on host webtest
> # CPU = Athlon at 900 MHz, Total RAM = 64
> # Time (empty) = empty loop
> # 5.55user 0.01system 0:05.58elapsed
> # Time (full) = (10 Billion flops)
> # 14.77user 0.00system 0:14.77elapsed
> 
> speed = 10B/(14.77 - 5.55) = 1085 MFLOPS.  Astounding.  Must pipeline
> the floats at least.  Too bad this speed isn't reflected in my Monte
> Carlo routines...
> 
> 
> #============================================================
> # cbench benchmark run on host b1
> # CPU = PII at 400 MHz, Total RAM = 512
> # Time (empty) = (empty loop)
> # 12.67user 0.00system 0:12.67elapsed
> # Time (full) = (10 Billion flops)
> # 53.46user 0.00system 0:53.45elapsed
> 
> speed = 10B/(53.46 - 12.67) = 245 MFLOPS.  Not bad.
> 
> #============================================================
> # cbench benchmark run on host qcd1
> # CPU = alpha_21264 at 667 MHz, Total RAM = 512
> # Compiler is ccc, forced to actually execute the loop.
> # Time (empty) =
> # 0.00user 0.00system 0:00.00elapsed
> # Time (full) = (doing 10 Billion FLOPS)
> # 12.93user 0.00system 0:12.94elapsed
> 
> SO, speed = 10G/(12.93 - 0.0) = 773 MFLOPS.  Not bad at all.  Of course
> it COSTS a whole lot more per FLOP than an Athlon or P6...
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu-rate-0.0.1.tgz
Type: application/octet-stream
Size: 21147 bytes
Desc: Tarball of BogoMFLOP generating code.  Have Fun.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000707/8cf2168e/attachment.obj>