[Beowulf] Picking a processor

Mon Dec 13 05:13:37 PST 2004

On Sun, 12 Dec 2004, SF Husain wrote:

> Hi
> 
> I'm currently designing a beowulf style parallel processor and am trying to
> decide which processor to use for the nodes. My project requires my final
> design for the parallel processor to be able to provide a sustained throuput
> of 0.25 TFlops.
> 
> My research tells me that in general that the flop rate scales up linearly.
> My trouble is that I'm having trouble finding estimates for the flop rates
> of the processors I'm looking at.
> 
> I've looked at the specfp2000 results but as far as I can tell their numbers
> do not easily convert to a flop rate. Could anyone tell me how I can find
> estimates for the flop rates of processors or if there is any rough sort of
> conversion that I can do on these spec (or any other) benchmark results.
> 
> I'm aware that the actual rate is dependent on type of work given  to the
> processors. However my project is only a design exercise aimed at developing
> research skills so even a very rough conversion or sourse of sample results
> would be suitable for my purposes.
> 
> If anyone could help that would be great.

Sigh.  I suppose that the first question one has to ask is "what's a
FLOPS", isn't it?  "Floating point operations per second" seems a bit
ambiguous, given the wide range of things that can be considered a
floating point operation.

The second question one MIGHT ask is why you are designing a system with
a targeted FLOPS rating regardless of budget and regardless of the
relationship between FLOPS and the work you actually want to
accomplish.  This is not a specious question -- in actual fact the
"correct" thing to do for a variety of fairly obvious reasons is to
design a system with a mind towards a particular work capacity of the
work you want to accomplish, or more reasonably, to take what you can
afford to spend and design a machine that can do as much of that work as
possible per dollar spent.  Optimizing cost-benefit is what the game is
all about, not being able to boast of 0.25 TFLOPS (whatever that means).

This matters quite a bit, because an optimal design for a real parallel
project may well spend your budget in ways than just optimizing FLOPS.
This will almost certainly be true if your application has any sort of
rich structure at all -- interprocessor communications over the network,
a large memory footprint, a mix of local and nonlocal memory accesses,
trancendental function calls.

After the lecture, I suppose I'll answer your question.  There are
several benchmarks that return FLOPS.  "The" benchmark that returns
FLOPS is likely linpack, a linear algebra benchmark.  However, stream
returns MFLOPS fractionated across four distinct floating point
operations -- copying a vector of floats, scaling a vector of floats,
multiplying two vectors of floats, and multiplying and adding three
vectors of floats (multiply/add is a single pipelined operation on many
processors).  Stream acts strictly on a sequential vector too large
(>4x) to fit in cache and is as much a memory speed benchmark as it is
floating point.  cpu_rate contains embedded stream with the ability to
vary vector size, so you can measure FLOPS for vectors inside cache.
This will (naturally) give you a much higher rate (and let you reach
your design goal with many fewer processors) but won't necessarily mean
anything to your application, which you assert doesn't matter but
obviously does.  It also gives you a stream-like +-*/ (including
DIVISION, which is much slower than multiplication or addition) test
that returns "bogoFLOPS" and which will let you pump the number of
processors much HIGHER.  It contains a "savage" benchmark which measures
trancendental rates.  lmbench contains microbenchmark code for directly
measuring floating point rates.  Finally, you can always look up vendor
spec per processor and use "theoretical peak" FLOPS, which will be
something like a multiplier like 0.5-2 x the CPU clock.

Hmmm, that doesn't really answer your question, but it does indicate why
the question is a silly one.  Pick your definition and you can push the
FLOPS rating of any given CPU a factor of any given CPU up and down over
nearly an order of magnitude, and be able to fully justify the number
either way.  You might as well design a system with a target aggregate
(bits*clock) where bits is the datapath width and clock is the CPU
clock.  That's likely to be within a factor of two or so of comparable
across many architectures....

    rgb

> 
> Thanks
> 
> Sufi
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu