How many Gflops?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduFri May 11 14:07:32 PDT 2001
- Previous message: How many Gflops?
- Next message: KVM Switch
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 11 May 2001, Rob Simac wrote: > I would like to find out if anyone knows how many Gflops at Athlon > 1.3Ghz CPU can perform at peak. There is the perpetual question of "what's a gigaflop" that makes this question ambgiguous if not meaningless. However, I'll give you at least one answer and you can judge how meaningful it really is for you. In L2 I've measured about 270-275 peak MFLOPS (double precision) with cpu-rate (http://www.phy.duke.edu/brahma) (which averages the rate at which addition, subtraction, multiplication and division occur, where division is generally very slow and a rate limiting factor) on a 1.2 GHz Tbird Athlon. Extrapolating (as is pretty reasonable to do in this case, in cache) to a 1.33 GHz Tbird one might get 300 MFLOPS (or only 0.3 of a GFLOPS -- not even ONE GFLOPS). However, as one increases the size of the memory vectors one operates on (running out of main memory instead of cache) the rate drops off to about 115 MFLOPS where at least part of that good a performance (and it really is quite good, comparatively -- only Alphas benchmark faster out there) is due to the use of DDR, as in this regime floating point is limited by streaming large memory access speed (so stream MFLOPS becomes a viable measure of floating point speed if you prefer them to cpu-rate). This is not peak, though. The cpu-rate numbers aren't peak either. It is quite possible for aggressive optimization, different compiler choices, hand-coded assembler, and perhaps the use of e.g. prefetch to improve them, and then there are the manufacturer's quoted theoretical peak "maximum FLOPS" which I've never seen or even heard of anybody who has seen but which might exist. cpu-rate also always involves SOME sort of vector addressing -- it doesn't just multiply four static variables a gazillion times and evaluate the rate, so it arguably isn't even close to a register-to-register peak rate without any need to access memory at all. However, the cpu-rate numbers are based on straightforward compiled code and are at least MAYBE relevant to certain common operations in core loops. Then there are LINPACK MFLOPS and probably others. MFLOPS is really a pretty meaningless measure, especially given that "peak" MFLOPS will seriously increase if the operation(s) in question is just addition and/or multiplication (which are often heavily optimized in the chip design). As an example of another trap, I've learned the hard way that many vendors (Intel, for example) optimize division by integers that are a power of two so that it is done by a bit shift instead of a full floating point division algorithm -- a measure of "FLOPS" based on (floating point!) multiplication or division by numbers that happen to be integers can be skewed by more than a factor of 2 up. Are these "peak" FLOPS? Or just absurdly unlikely accidents in most real code? A more useful way of viewing and using measures like MFLOPS with all its many possible definitions is comparatively. The fact that an Athlon 1200 Tbird with DDR gets 270 or so peak double precision MFLOPS on cpu-rate is really pretty irrelevant unless your application EXACTLY resembles cpu-rate in its main core loop. However, the fact that it gets 270 peak while a 933 MHz PIII with ordinary PC133 gets only a bit more than 100 peak while an Athlon 800 MHz Tbird with PC133 gets perhaps 177 peak and a lowly 466 MHz Celron gets about 50 peak is possibly relevant. In both cases the peak scales nearly perfectly with CPU clock WITHIN families (Athlon vs P6-family) which gives us a certain amount of warm fuzziness -- the benchmark is insensitive to the (>>very<< different) main memory speeds, as it should be in this range (for vectors maybe 40-80K in length that fit easily in all the L2 caches). It also shows that for code of this type, the Athlon blows the pants off of the P6. HOWEVER, other code that I run shows the Athlon slightly underperforming equivalent clock compared to the P6 family. Then there are the very different and not particularly CPU clock-speed proportional results that hold when the vectors are much bigger than L2. Then there is the fact that cache sizes differ. Then there are latency dominated (instead of streaming vector memory dominated) results to consider. Your mileage can and almost certainly will vary. Aside from this sort of VERY crude rough comparison, the only really useful purpose for the FLOPS rating of a system (any of them!) is to put it into a grant proposal or bandy it around to impress the more ignorant and impressionable of your friends. Otherwise one should seek to prototype and benchmark your actual application, or hope that your code nearly exactly resembles lmbench, or LINPACK, or stream, or cpu-rate, or any of the various components of SPEC. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: How many Gflops?
- Next message: KVM Switch
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
