[Beowulf] The GPU power envelope (was difference between accelerators)

Thu Mar 14 04:29:02 PDT 2013

On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:
>
>>> I think HSA is potentially interesting for HPC, too.
>>>   I really expect
>>> AMD and/or Intel to ship products this year that have a C/GPU chip
>>> mounted on
>>> the same interposer as some high-bandwidth ram.
>>
>> How can an integrated gpu outperform a gpgpu card?
>
> if you want dedicated gpu computation, a gpu card is ideal.
> obviously, integrated GPUs reduce the PCIe latency overhead,
> and/or have an advantage in directly accessing host memory.
>
> I'm merely pointing out that the market has already transitioned to  
> putting integrated gpus - the vote on this is closed.
> the real question is what direction the onboard gpu takes:
> how integrated it becomes with the cpu, and how it will take  
> advantage of upcoming 2.5d-stacked in-package dram.

Integrated gpu's will of course always have a very limited power budget.

So the gpgpu cards with the same generation gpu for gpgpu from the  
same manufacturer with a bigger power envelope
is always going to be 10x faster of course.

If you'd get 10 computers with 10 apu's, even for a small price, you  
still would need an expensive network and switch to
handle those, so that's 10 ports. So that's 1000 dollar a port  
roughly, so that's $10k extra, and we assume then that your
massive supercomputer doesn't get into trouble further up in  
bandwidth otherwise your network cost suddenly gets $3000 a port
instead of $2k a port, with factor 10 ports more.

That's always going to lose it moneywise from a single gpgpu card  
that's 10x faster.

Whether that's Xeon Phi version X Nvidia Kx0X, it's always going to  
be 10x faster of course and 10x cheaper for massive supercomputing.

>
>> Something like what is it 25 watt versus 250 watt, what will be  
>> faster?
>
> per-watt?  per dollar?  per transaction?
>
> the integrated gpu is, of course, merely a smaller number of cores  
> as the
> separate card, so will perform the same, relative to a proportional  
> slice of the appropriate-generation add-in-card.
>
> trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,  
> delivering 584 SP gflops - 65W iirc.  but only 30 GB/s for it and  
> the CPU.
>
> let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.
> about 1/3 the cores, flops, and something less than 1/5 the bandwidth.
> no doubt the lower bandwidth will hurt some codes, and the lower  
> host-gpu
> latency will help others.  I don't know whether APUs have the same  
> SP/DP ratio as comparable add-in cards.
>
>> I assume you will not build 10 nodes with 10 cpu's with integrated
>> gpu in order to rival a
>> single card.
>
> no, as I said, the premise of my suggestion of in-package ram is  
> that it would permit glueless tiling of these chips.  the number  
> you could tile in a 1U chassis would primarily be a question of  
> power dissipation.
> 32x 40W units would be easy.  perhaps 20 60W units.  since I'm just  
> making up numbers here, I'm going to claim that performance will be  
> twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
> I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/ 
> socket - a total capacity of 40 GB/RU at 1280 GB/s.
>
> compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and  
> 320 GB/s each.  13.8 Gflops, 960 GB/s.  similar power dissipation.