[Beowulf] crunch per kilowatt: GPU vs. CPU

Mon May 18 13:09:05 PDT 2009

Joe Landman wrote:
> Hi David
> 
> David Mathog wrote:
>> Although the folks now using CUDA are likely most interested in crunch
>> per unit time (time efficiency), perhaps some of you have measurements
>> and can comment on the energy efficiency of GPU vs. CPU computing?  That
>> is, which uses the fewest kilowatts per unit of computation.  My guess
> 
> Using theoretical rather than "actual" performance, unless you get the
> same code doing the same computation on both units:
> 
> 1 GPU ~ 960 GFLOP single precision, ~100 GFLOP double precision @ 160W
> 
> 1 CPU ~ 4x (3 GHz x 4 DP flops/cycle) = 48 GFLOP double precision @ 75W
> 
> You can argue about these numbers a bit, but these are reasonably close.
> 
> This said, any application will not likely be 100% efficient.  So
> ignoring the efficiency issue, its close to a wash for double precision
> FP calculation.  Not even close for single precision or integer
> calculations though, with the GPU providing far more computing power per
> watt than the CPU.
> 

And you didn't include the fact that a GPU needs a CPU to run.  So what
is the "right" way to compare that?  So does the GPU node now take
160W+110W?  The is my estimate of the complete power draw of a node
with only one socket.  The flops are nice, but what attracts me to
GPUs is the memory bandwidth.  One of our weather models is much more
memory bandwidth bound than compute bound.

Our current thinking is to have 1 GPU per socket.  So comparing a GTX285
(159 GB/s each) to harpertown (12.8GB/s per node).

318/12.8 = 24.8x

Nehalem changes the game though.  I think (please correct me if I am wrong),
that the theoretical peak bandwidth is 25.6GB/s per socket.  So the ratio
is:

318/51.2 = 6.2x

Not nearly as good.  We should see a good bump in memory bandwidth at the
end of the year with the GT300.  Leaked (unofficial) specs put the memory
bandwidth at 256GB/s, which would change the ratio to 10x.  A nice speed-up,
but I am concerned that the next generation NVIDIA still won't provide
enough power to justify the work to use it (but we will push forward because
we don't really know what will happen).

Craig

> 
>> is that the most energy efficient solution is still CPU based,  because
>> Intel and AMD have both worked hard on that issue.  Seems like this sort
>> of information would be useful when comparing CPU and GPU based
>> solutions, in order to determine the overall cost efficiency (total
>> crunch /(purchase price + operating expense).
> 
> NVidia has recently been marketing these as "replaced a large cluster
> with a much smaller one".  Then we have seen infrastructure cost
> comparisons.  These make sense if your applications map onto the Cuda
> systems very well (GPU-HMMer), and you can distribute computations
> across many of them (MPI-HMMer) at once to realize these performance
> deltas.
> 
> Not all applications will work this way though, so YMMV.  And your power
> too...
> 
>>
>> Thanks,
>>
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)