[Beowulf] The GPU power envelope (was difference between accelerators)

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Thu Mar 14 20:52:25 PDT 2013

I think what you've got here is basically the idea that "things that are
closer, consume less power and cost less because you don't have the
"interface cost".

A FPU sitting on the bus with the integer ALU inside the chip has minimum
overhead.. No going on and off chip and the associated level shifting, no
charging and discharging of the transmission lines, etc.

A coprocessor sitting on the bus with the CPU is a bit worse.. The
connection has to go off chip, so you have to change voltage levels, and
physically charge and discharge a longer trace/transmission line.

A graphics card on a PCI bus has not only the on/off chip transition, it
has more than one because the PCI interface also goes through that. More
capacitors to charge and discharge too.

A second node connected with some wideband interconnect, but in a
different box...

You get the idea..

This is why people are VERY interested in on chip optical transmitters and
receivers (e.g. Things like VCSELs and APDs).  You could envision a
processor with an array of transmitters and receivers to create point to
point links to other processors that are within the field of view.  Only
one "change of media"

On 3/14/13 4:29 AM, "Vincent Diepeveen" <diep at xs4all.nl> wrote:

>On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:
>>>> I think HSA is potentially interesting for HPC, too.
>>>>   I really expect
>>>> AMD and/or Intel to ship products this year that have a C/GPU chip
>>>> mounted on
>>>> the same interposer as some high-bandwidth ram.
>>> How can an integrated gpu outperform a gpgpu card?
>> if you want dedicated gpu computation, a gpu card is ideal.
>> obviously, integrated GPUs reduce the PCIe latency overhead,
>> and/or have an advantage in directly accessing host memory.
>> I'm merely pointing out that the market has already transitioned to
>> putting integrated gpus - the vote on this is closed.
>> the real question is what direction the onboard gpu takes:
>> how integrated it becomes with the cpu, and how it will take
>> advantage of upcoming 2.5d-stacked in-package dram.
>Integrated gpu's will of course always have a very limited power budget.
>So the gpgpu cards with the same generation gpu for gpgpu from the
>same manufacturer with a bigger power envelope
>is always going to be 10x faster of course.
>If you'd get 10 computers with 10 apu's, even for a small price, you
>still would need an expensive network and switch to
>handle those, so that's 10 ports. So that's 1000 dollar a port
>roughly, so that's $10k extra, and we assume then that your
>massive supercomputer doesn't get into trouble further up in
>bandwidth otherwise your network cost suddenly gets $3000 a port
>instead of $2k a port, with factor 10 ports more.
>That's always going to lose it moneywise from a single gpgpu card
>that's 10x faster.
>Whether that's Xeon Phi version X Nvidia Kx0X, it's always going to
>be 10x faster of course and 10x cheaper for massive supercomputing.
>>> Something like what is it 25 watt versus 250 watt, what will be
>>> faster?
>> per-watt?  per dollar?  per transaction?
>> the integrated gpu is, of course, merely a smaller number of cores
>> as the
>> separate card, so will perform the same, relative to a proportional
>> slice of the appropriate-generation add-in-card.
>> trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,
>> delivering 584 SP gflops - 65W iirc.  but only 30 GB/s for it and
>> the CPU.
>> let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.
>> about 1/3 the cores, flops, and something less than 1/5 the bandwidth.
>> no doubt the lower bandwidth will hurt some codes, and the lower
>> host-gpu
>> latency will help others.  I don't know whether APUs have the same
>> SP/DP ratio as comparable add-in cards.
>>> I assume you will not build 10 nodes with 10 cpu's with integrated
>>> gpu in order to rival a
>>> single card.
>> no, as I said, the premise of my suggestion of in-package ram is
>> that it would permit glueless tiling of these chips.  the number
>> you could tile in a 1U chassis would primarily be a question of
>> power dissipation.
>> 32x 40W units would be easy.  perhaps 20 60W units.  since I'm just
>> making up numbers here, I'm going to claim that performance will be
>> twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
>> I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/
>> socket - a total capacity of 40 GB/RU at 1280 GB/s.
>> compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and
>> 320 GB/s each.  13.8 Gflops, 960 GB/s.  similar power dissipation.
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list