[Beowulf] gpu numbers

Massimiliano Fatica mfatica at gmail.com
Sun Nov 23 17:33:56 PST 2008

On the GT200, there are 30 multiprocessors, each with 8 single
precision (SP)  units, 1 double precision (DP) unit and 1 special
function (SFU) unit.
Each SP and DP unit  can perform a multiply and add, the SFU unit if
not busy computing a transcendental function can perform a single
precision multiply (this is where the 3 comes from in the peak
performance number for single).
So,  the peak performance numbers are:
    SP: 240*3*Clock
    DP: 30*2*Clock

The C1060 has a clock of 1.296Ghz (SP peak =933 Gflops, DP peak=77
Gflops ), the S1070 has a clock of 1.44Ghz (SP peak =1036 Gflops, DP
peak=86 Gflops ). These are peak numbers, in reality the difference
between single and double is between 4x and 6x (most of the double
precision codes are running close to 80-90% of peak, you can really
feed data to the unit).

The power numbers are including not only the GPU but also the memory
(and we are talking about 4GB of GDDR3 memory) that can account for
several tens of Watts.
There was a CUDA tutorial at SC08, these are some numbers presented
from  Hess Corporation  on the performance of a GPU cluster for
seismic imaging: a 128-GPU cluster (32 S1070) out-perform a 3000 CPU
cluster, with speed ups varying from 5x to 60x depending on the

If  we are talking double precision, this is a preliminary Linpack
result for a small problem (only 4GB) on a standard Sun Ultra 24 (1
Core2 Extreme CPU Q6850  @ 3GHz, standard 530W power supply) with a  1
Tesla C1060 :
T/V                N    NB     P     Q               Time                 Gflops
WR00L2L2       23040   960     1     1              97.91              8.328e+01
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0048141 ...... PASSED

The workstation alone is performing around 38 Gflops. So even in
double precision, you can use a cheap single socket machine and still
get results comparable to more expensive server configuration with
Xeon and multi-socket motherboard. If you are talking clusters, you
can reduce the number of nodes and get a significant saving on network
if you are using IB or 10GigE


On Sun, Nov 23, 2008 at 3:00 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
> one thing I was surprised at is the substantial penalty that the current
> gtx280-based gpus pay for double-precision.
> I think I understand the SP throughput - since these are genetically
> graphics processors, their main flop-relevant op is blend:
>        pixA * alpha + pixB * beta
> that's 3 sp flops, and indeed the quoted 933 glops = 240 cores @ 1.3 GHz *
> 2mul1add/cycle.  I'm a little surprised
> that they quote only 78 DP gflops - 1/12 the SP rate.
> I counted ops when doing base-10 multiplication on paper,
> and it seemed to require only 4x each SP mul.  I guess the problem might
> simply be that each core isn't OOO like CPUs,
> or that emulating DP does't optimally utilize the available 2mul+add.
> note also: 78 DP Gflops/~200W.  3.2 GHz QC CPU: 51 DP Gflops/~200W.
> figuring power is a bit tricky, but price is even worse.  for power,
> NV claims <200W (not less than 150 in any of the GTX280 reviews, though).
> but you have to add in a host, which will probably be around 300W;
> assuming you go for the C1070, the final is 4*78/(800+300).
> a comparison CPU-based machine would be something like 2*51/350W.
> amusingly, almost the same DP flops per watt ;)
> does anyone know whether the reputed hordes of commercial Cuda apps
> mostly stick to SP?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list