energy costs

Wed Mar 12 07:23:50 PST 2003

> >> PS Pentium 4 sustained performance from memory is about
> >>    5% of peak (stream triad).
> >
> >that should be 50%, I think.
> 
> Nope ... not "from memory".
> 
> A 2.8 GHz P4 using SSE2 instructions can deliver two
> 64-bit floating point results per clock or 5.6 Gflops
> peak performance at this clock.  The stream triad (a 
> from-memory, multiply-add operation) for a 2.8 GHz 
> P4 produces only 200 Mflops (see stream website). The 
> arithmetic is then:
> 
> 200/5600 = .0357 or 3.57% (so 5% is a gift)

oh, I see.  to me, that's a strange definition of "peak",
since stream is, by intention, always bottlenecked on 
memory bandwidth, since its FSB is either 3.2 or 4.3 GB/s.
it'll deliver roughly 50% of that to stream.

> As you suggest, the P4 will (as does the Cray X1) do 
> significantly better when cache use/re-use is a 
> significant factor.

no, it's not a matter of reuse, but what you consider "peak".

I think the real take-home message is that this sort of 
fraction-of-theoretical-peak is useless, and you need to look
at the actual numbers, possibly scaled by price.

as a matter of fact, I'm always slightly puzzled by this sort
of conversation.  yes, crays and vector computers in general 
are big/wide memory systems with a light scattering of ALU's.
a much different ratio than the "cache-based" computing world.

but if your data is huge and uniform, don't you win big by 
partitioning (data or work grows as dim^2, but communication
at partitions scaling much slower)?  that would argue, for instance,
that you should run on a cluster of e7205 machines, where each node
delivers a bit more than the 200 Gflops above under $2k, and should 
scale quite nicely until your interconnect runs out of steam, 
say, several hundred CPUs.  the point is really that stream-like 
codes are almost embarassingly parallel.

so what's the cost per stream-triad gflop from Cray?