[Beowulf] The GPU power envelope ------ Re: Beowulf Digest, Vol 109, Issue 22 ----

Sat Mar 16 20:01:21 PDT 2013

What budget do you have to build that FPGA?

As under a few million dollar project team to design that FPGA you  
won't easily beat a GPU with a FPGA for things like matrix calculations
and especially not for double precision floating point calculations  
in general spoken.

Would be interesting to know whether there is exceptions to the rule  
as of today... ...like a guy like Feng Hsiu Hsu (the deep blue chip  
programmer)
who single handed programmed a chip.

Every year that passes the budget you need to beat the latest GPU's  
is increasing,
note i would want to extend the gpu versus fpga discussion a tad more  
and also include Xeon Phi in this list as well as the latest IBM  
incarnation,
though that isn't a GPU of course, yet it already runs what is it 18  
cores @ 72 threads or something (didn't check lately)?

IBM so to speak is scaling up their BlueGenes quicker in number of  
cores than the GPU's currently advance there...

Beating these specialized HPC/GPGPU processors with a FPGA seems to  
get tougher and tougher.

Of course you can do it for specific calculations, especially prime  
numbers...  ...but at which budget costs?

On Mar 16, 2013, at 9:11 AM, Reiner Hartenstein wrote:

>
>
> which accelerators have the better power envelope?   GPUs or FPGAs?
>
>
> Best regards,
> Reiner
>
>
> Am 15.03.2013 20:00, schrieb beowulf-request at beowulf.org:
>> Send Beowulf mailing list submissions to beowulf at beowulf.org To  
>> subscribe or unsubscribe via the World Wide Web, visit http:// 
>> www.beowulf.org/mailman/listinfo/beowulf or, via email, send a  
>> message with subject or body 'help' to beowulf-request at beowulf.org  
>> You can reach the person managing the list at beowulf- 
>> owner at beowulf.org When replying, please edit your Subject line so  
>> it is more specific than "Re: Contents of Beowulf digest..."  
>> Today's Topics: 1. Re: The GPU power envelope (was difference  
>> between accelerators) (Lux, Jim (337C))  
>> --------------------------------------------------------------------- 
>> - Message: 1 Date: Fri, 15 Mar 2013 03:52:25 +0000 From: "Lux, Jim  
>> (337C)" <james.p.lux at jpl.nasa.gov> Subject: Re: [Beowulf] The GPU  
>> power envelope (was difference between accelerators) To: Beowulf  
>> List <beowulf at beowulf.org> Message-ID: <CD67E739.2E0F8% 
>> james.p.lux at jpl.nasa.gov> Content-Type: text/plain; charset="us- 
>> ascii" I think what you've got here is basically the idea that  
>> "things that are closer, consume less power and cost less because  
>> you don't have the "interface cost". A FPU sitting on the bus with  
>> the integer ALU inside the chip has minimum overhead.. No going on  
>> and off chip and the associated level shifting, no charging and  
>> discharging of the transmission lines, etc. A coprocessor sitting  
>> on the bus with the CPU is a bit worse.. The connection has to go  
>> off chip, so you have to change voltage levels, and physically  
>> charge and discharge a longer trace/transmission line. A graphics  
>> card on a PCI bus has not only the on/off chip transition, it has  
>> more than one because the PCI interface also goes through that.  
>> More capacitors to charge and discharge too. A second node  
>> connected with some wideband interconnect, but in a different  
>> box... You get the idea.. This is why people are VERY interested  
>> in on chip optical transmitters and receivers (e.g. Things like  
>> VCSELs and APDs). You could envision a processor with an array of  
>> transmitters and receivers to create point to point links to other  
>> processors that are within the field of view. Only one "change of  
>> media" On 3/14/13 4:29 AM, "Vincent Diepeveen" <diep at xs4all.nl>  
>> wrote:
>>>
>>> On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:
>>>>
>>>>>> I think HSA is potentially interesting for HPC, too. I really  
>>>>>> expect AMD and/or Intel to ship products this year that have a  
>>>>>> C/GPU chip mounted on the same interposer as some high- 
>>>>>> bandwidth ram.
>>>>> How can an integrated gpu outperform a gpgpu card?
>>>> if you want dedicated gpu computation, a gpu card is ideal.  
>>>> obviously, integrated GPUs reduce the PCIe latency overhead, and/ 
>>>> or have an advantage in directly accessing host memory. I'm  
>>>> merely pointing out that the market has already transitioned to  
>>>> putting integrated gpus - the vote on this is closed. the real  
>>>> question is what direction the onboard gpu takes: how integrated  
>>>> it becomes with the cpu, and how it will take advantage of  
>>>> upcoming 2.5d-stacked in-package dram.
>>> Integrated gpu's will of course always have a very limited power  
>>> budget. So the gpgpu cards with the same generation gpu for gpgpu  
>>> from the same manufacturer with a bigger power envelope is always  
>>> going to be 10x faster of course. If you'd get 10 computers with  
>>> 10 apu's, even for a small price, you still would need an  
>>> expensive network and switch to handle those, so that's 10 ports.  
>>> So that's 1000 dollar a port roughly, so that's $10k extra, and  
>>> we assume then that your massive supercomputer doesn't get into  
>>> trouble further up in bandwidth otherwise your network cost  
>>> suddenly gets $3000 a port instead of $2k a port, with factor 10  
>>> ports more. That's always going to lose it moneywise from a  
>>> single gpgpu card that's 10x faster. Whether that's Xeon Phi  
>>> version X Nvidia Kx0X, it's always going to be 10x faster of  
>>> course and 10x cheaper for massive supercomputing.
>>>>
>>>>> Something like what is it 25 watt versus 250 watt, what will be  
>>>>> faster?
>>>> per-watt? per dollar? per transaction? the integrated gpu is, of  
>>>> course, merely a smaller number of cores as the separate card,  
>>>> so will perform the same, relative to a proportional slice of  
>>>> the appropriate-generation add-in-card. trinity a10-5700 has 384  
>>>> radeon 69xx cores running at 760 MHz, delivering 584 SP gflops -  
>>>> 65W iirc. but only 30 GB/s for it and the CPU. let's compare  
>>>> that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops. about  
>>>> 1/3 the cores, flops, and something less than 1/5 the bandwidth.  
>>>> no doubt the lower bandwidth will hurt some codes, and the lower  
>>>> host-gpu latency will help others. I don't know whether APUs  
>>>> have the same SP/DP ratio as comparable add-in cards.
>>>>>
>>>>> I assume you will not build 10 nodes with 10 cpu's with  
>>>>> integrated gpu in order to rival a single card.
>>>> no, as I said, the premise of my suggestion of in-package ram is  
>>>> that it would permit glueless tiling of these chips. the number  
>>>> you could tile in a 1U chassis would primarily be a question of  
>>>> power dissipation. 32x 40W units would be easy. perhaps 20 60W  
>>>> units. since I'm just making up numbers here, I'm going to claim  
>>>> that performance will be twice that of trinity (a nice round 1  
>>>> Tflop apiece or 20 Tflops/RU. I speculate that 4x 4Gb in-package  
>>>> gddr5 would deliver 64 GB/s, 2GB/ socket - a total capacity of  
>>>> 40 GB/RU at 1280 GB/s. compare this to a 1U server hosting 2-3  
>>>> K10 cards = 4.6 Gflops and 320 GB/s each. 13.8 Gflops, 960 GB/s.  
>>>> similar power dissipation.
>>> _______________________________________________ Beowulf mailing  
>>> list, Beowulf at beowulf.org sponsored by Penguin Computing To  
>>> change your subscription (digest mode or unsubscribe) visit  
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> ------------------------------  
>> _______________________________________________ Beowulf mailing  
>> list Beowulf at beowulf.org http://www.beowulf.org/mailman/listinfo/ 
>> beowulf End of Beowulf Digest, Vol 109, Issue 22  
>> ****************************************
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf