[Beowulf] The GPU power envelope (was difference between accelerators)
Vincent Diepeveen
diep at xs4all.nl
Thu Mar 14 04:29:02 PDT 2013
On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:
>
>>> I think HSA is potentially interesting for HPC, too.
>>> I really expect
>>> AMD and/or Intel to ship products this year that have a C/GPU chip
>>> mounted on
>>> the same interposer as some high-bandwidth ram.
>>
>> How can an integrated gpu outperform a gpgpu card?
>
> if you want dedicated gpu computation, a gpu card is ideal.
> obviously, integrated GPUs reduce the PCIe latency overhead,
> and/or have an advantage in directly accessing host memory.
>
> I'm merely pointing out that the market has already transitioned to
> putting integrated gpus - the vote on this is closed.
> the real question is what direction the onboard gpu takes:
> how integrated it becomes with the cpu, and how it will take
> advantage of upcoming 2.5d-stacked in-package dram.
Integrated gpu's will of course always have a very limited power budget.
So the gpgpu cards with the same generation gpu for gpgpu from the
same manufacturer with a bigger power envelope
is always going to be 10x faster of course.
If you'd get 10 computers with 10 apu's, even for a small price, you
still would need an expensive network and switch to
handle those, so that's 10 ports. So that's 1000 dollar a port
roughly, so that's $10k extra, and we assume then that your
massive supercomputer doesn't get into trouble further up in
bandwidth otherwise your network cost suddenly gets $3000 a port
instead of $2k a port, with factor 10 ports more.
That's always going to lose it moneywise from a single gpgpu card
that's 10x faster.
Whether that's Xeon Phi version X Nvidia Kx0X, it's always going to
be 10x faster of course and 10x cheaper for massive supercomputing.
>
>> Something like what is it 25 watt versus 250 watt, what will be
>> faster?
>
> per-watt? per dollar? per transaction?
>
> the integrated gpu is, of course, merely a smaller number of cores
> as the
> separate card, so will perform the same, relative to a proportional
> slice of the appropriate-generation add-in-card.
>
> trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,
> delivering 584 SP gflops - 65W iirc. but only 30 GB/s for it and
> the CPU.
>
> let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.
> about 1/3 the cores, flops, and something less than 1/5 the bandwidth.
> no doubt the lower bandwidth will hurt some codes, and the lower
> host-gpu
> latency will help others. I don't know whether APUs have the same
> SP/DP ratio as comparable add-in cards.
>
>> I assume you will not build 10 nodes with 10 cpu's with integrated
>> gpu in order to rival a
>> single card.
>
> no, as I said, the premise of my suggestion of in-package ram is
> that it would permit glueless tiling of these chips. the number
> you could tile in a 1U chassis would primarily be a question of
> power dissipation.
> 32x 40W units would be easy. perhaps 20 60W units. since I'm just
> making up numbers here, I'm going to claim that performance will be
> twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
> I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/
> socket - a total capacity of 40 GB/RU at 1280 GB/s.
>
> compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and
> 320 GB/s each. 13.8 Gflops, 960 GB/s. similar power dissipation.
More information about the Beowulf
mailing list