[Beowulf] gpgpu

Wed Aug 27 23:20:15 PDT 2008

Hi Vincent,
Yes, the libraries can't cover all calculations, or they can do only some calculations based on GPU.
GPGPU is just a small step for many-core architecture. It equips great power, but with deadly weakness. 
When you have got a complex calculations could be arranged to many pieces and each piece can work dependently. The calculation can be done well on a GPU or GPU failed to pump its power.
Currently, the DP performance of GPU is not good as we expected, or only 1/8 1/10 of SP Flops. It is also a problem.
I would suggest hybrid computation platforms, with GPU, CPU, and processors like Clearspeed. It may be a good topic for programming model.
Regards,
Li, Bo
----- Original Message ----- 
From: "Vincent Diepeveen" <diep at xs4all.nl>
To: "Li, Bo" <libo at buaa.edu.cn>
Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf" <beowulf at beowulf.org>
Sent: Thursday, August 28, 2008 12:22 AM
Subject: Re: [Beowulf] gpgpu


> Hi Bo,
> 
> Thanks for your message.
> 
> What library do i call to find primes?
> 
> Currently it's searching here after primes (PRP's)  in the form of p  
> = (2^n + 1) / 3
> 
> n is here about 1.5 million bits roughly as we speak.
> 
> For SSE2 type processors there is the George Woltman assembler code  
> (MiT) to do the squaring + implicit modulo;
> how do you plan to beat that type of real optimized number crunching  
> at a GPU?
> 
> You'll have to figure out a way to find an instruction level  
> parallellism of at least 32,
> which also doesn't write to the same cacheline, i *guess* (no  
> documentation to verify that in fact).
> 
> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
> 
> In fact the first problem to solve is to do some sort of squaring  
> real quickly.
> 
> If you figured that out at a PC, experience learns you're still  
> losing a potential of factor 8,
> thanks to another zillion optimizations.
> 
> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver  
> on paper @ 250 watt TDP (you bet it will consume that
> when you let it work so hard) means GPU delivers effectively less  
> than 7 gflops double precision thanks to inefficient code.
> 
> Additionally remember the P4. On paper in integers claim was when it  
> released it would be able to execute 4 integers a
> cycle, reality is that it was a processor getting an IPC far under 1  
> for most integer codes. All kind of stuff sucked at it.
> 
> The experience learns this is the same for todays GPU's, the  
> scientists who have run codes on it so far and are really experienced
> CUDA programmers, figured out the speed it delivers is a very big  
> bummer.
> 
> Additionally 250 watt TDP for massive number crunching is too much.
> 
> It's well over factor 2 power consumption of a quadcore. Now i can  
> take a look soon in China myself what power prices
> are over there, but i can assure you they will rise soon.
> 
> Now that's a lot less than a quadcore delivers with a tdp far under  
> 100 watt.
> 
> Now i explicitly mention the n's i'm searching here, as it should fit  
> within caches.
> So the very secret bandwidth you can practical achieve (as we know  
> nvidia lobotomized
> bandwidth in the GPU cards, only the Tesla type seems to be not  
> lobotomized),
> i'm not even teasing you with that.
> 
> This is true for any type of code. You're losing it to the details.  
> Only custom tailored solutions will work,
> simply because they're factors faster.
> 
> Thanks,
> Vincent
> 
> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
> 
>> Hello,
>> IMHO, it is better to call the BLAS or similiar libarary rather  
>> than programing you own functions. And CUDA treats the GPU as a  
>> cluster, so .CU is not working as our normal codes. If you have got  
>> to many matrix or vector computation, it is better to use Brook+/ 
>> CAL, which can show great power of AMD gpu.
>> Regards,
>> Li, Bo
>> ----- Original Message -----
>> From: "Mikhail Kuzminsky" <kus at free.net>
>> To: "Vincent Diepeveen" <diep at xs4all.nl>
>> Cc: "Beowulf" <beowulf at beowulf.org>
>> Sent: Wednesday, August 27, 2008 2:35 AM
>> Subject: Re: [Beowulf] gpgpu
>>
>>
>>> In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008
>>> 00:30:30 +0200):
>>>> Hi Mikhail,
>>>>
>>>> I'd say they're ok for black box 32 bits calculations that can do  
>>>> with
>>>> a GB or 2 RAM,
>>>> other than that they're just luxurious electric heating.
>>>
>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or
>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to
>>> BLAS/LAPACK/FFT :-)
>>>
>>> So I'll need to program something other. People say that the best
>>> choice is CUDA for Nvidia. When I look to sgemm source, it has  
>>> about 1
>>> thousand (or higher) strings in *.cu files. Thereofore I think that a
>>> bit more difficult alghorithm  as some special matrix diagonalization
>>> will require a lot of programming work :-(.
>>>
>>> It's interesting, that when I read Firestream Brook+ "kernel  
>>> function"
>>> source example - for addition of 2 vectors ("Building a High Level
>>> Language Compiler For GPGPU",
>>> Bixia Zheng (bixia.zheng at amd.com)
>>> Derek Gladding (dereked.gladding at amd.com)
>>> Micah Villmow (micah.villmow at amd.com)
>>> June 8th, 2008)
>>>
>>> - it looks SIMPLE. May be there are a lot of details/source lines
>>> which were omitted from this example ?
>>>
>>>
>>>> Vincent
>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is  
>>>> really
>>>> too much.
>>>
>>> 250 W is TDP, the average value declared is about 160 W. I don't
>>> remember, which GPU - from AMD or Nvidia - has a lot of special
>>> functional units for sin/cos/exp/etc. If they are not used, may be  
>>> the
>>> power will a bit more lower.
>>>
>>> What is about Firestream 9250, AMD says about 150 W (although I'm not
>>> absolutely sure that it's TDP) - it's as for some
>>> Intel Xeon quad-cores chips w/names beginning from X.
>>>
>>> Mikhail
>>>
>>>
>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>>>
>>>>> BTW, why GPGPUs are considered as vector systems ?
>>>>> Taking into account that GPGPUs contain many (equal) execution
>>>>> units,
>>>>> I think it might be not SIMD, but SPMD model. Or it depends from
>>>>> the software tools used (CUDA etc) ?
>>>>>
>>>>> Mikhail Kuzminsky
>>>>> Computer Assistance to Chemical Research Center
>>>>> Zelinsky Institute of Organic Chemistry
>>>>> Moscow
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit  
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>