[Beowulf] gpgpu

Thu Aug 28 10:52:23 PDT 2008

In message from "Li, Bo" <libo at buaa.edu.cn> (Thu, 28 Aug 2008 14:20:15 
+0800):
> ...
>Currently, the DP performance of GPU is not good as we expected, or 
>only 1/8 1/10 of SP Flops. It is also a problem.

AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W 
for DP. It's 5 times slower than SP.

Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS 
DP. The price will be, I suppose, about $2000 - as for 9170.

Let me look to modern dual socket quad-core beowulf node w/price about 
$4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP 
performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 
GFLOPS. 

Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.
Is it enough for essential calculation speedup - taking into account 
time for data transmission to/from GPU ?    

>I would suggest hybrid computation platforms, with GPU, CPU, and 
>processors like Clearspeed. It may be a good topic for programming 
>model.

Clearspeed, if there is no new hardware now, has not enough DP 
performance in comparison w/typical modern servers on quad-core CPUs.

Yours
Mikhail 

>Regards,
>Li, Bo
>----- Original Message ----- 
>From: "Vincent Diepeveen" <diep at xs4all.nl>
>To: "Li, Bo" <libo at buaa.edu.cn>
>Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf" 
><beowulf at beowulf.org>
>Sent: Thursday, August 28, 2008 12:22 AM
>Subject: Re: [Beowulf] gpgpu
>
>
>> Hi Bo,
>> 
>> Thanks for your message.
>> 
>> What library do i call to find primes?
>> 
>> Currently it's searching here after primes (PRP's)  in the form of p 
>> 
>> = (2^n + 1) / 3
>> 
>> n is here about 1.5 million bits roughly as we speak.
>> 
>> For SSE2 type processors there is the George Woltman assembler code 
>> 
>> (MiT) to do the squaring + implicit modulo;
>> how do you plan to beat that type of real optimized number crunching 
>> 
>> at a GPU?
>> 
>> You'll have to figure out a way to find an instruction level  
>> parallellism of at least 32,
>> which also doesn't write to the same cacheline, i *guess* (no  
>> documentation to verify that in fact).
>> 
>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
>> 
>> In fact the first problem to solve is to do some sort of squaring  
>> real quickly.
>> 
>> If you figured that out at a PC, experience learns you're still  
>> losing a potential of factor 8,
>> thanks to another zillion optimizations.
>> 
>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver 
>> 
>> on paper @ 250 watt TDP (you bet it will consume that
>> when you let it work so hard) means GPU delivers effectively less  
>> than 7 gflops double precision thanks to inefficient code.
>> 
>> Additionally remember the P4. On paper in integers claim was when it 
>> 
>> released it would be able to execute 4 integers a
>> cycle, reality is that it was a processor getting an IPC far under 1 
>> 
>> for most integer codes. All kind of stuff sucked at it.
>> 
>> The experience learns this is the same for todays GPU's, the  
>> scientists who have run codes on it so far and are really 
>>experienced
>> CUDA programmers, figured out the speed it delivers is a very big  
>> bummer.
>> 
>> Additionally 250 watt TDP for massive number crunching is too much.
>> 
>> It's well over factor 2 power consumption of a quadcore. Now i can  
>> take a look soon in China myself what power prices
>> are over there, but i can assure you they will rise soon.
>> 
>> Now that's a lot less than a quadcore delivers with a tdp far under 
>> 
>> 100 watt.
>> 
>> Now i explicitly mention the n's i'm searching here, as it should 
>>fit  
>> within caches.
>> So the very secret bandwidth you can practical achieve (as we know  
>> nvidia lobotomized
>> bandwidth in the GPU cards, only the Tesla type seems to be not  
>> lobotomized),
>> i'm not even teasing you with that.
>> 
>> This is true for any type of code. You're losing it to the details. 
>> 
>> Only custom tailored solutions will work,
>> simply because they're factors faster.
>> 
>> Thanks,
>> Vincent
>> 
>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
>> 
>>> Hello,
>>> IMHO, it is better to call the BLAS or similiar libarary rather  
>>> than programing you own functions. And CUDA treats the GPU as a  
>>> cluster, so .CU is not working as our normal codes. If you have got 
>>> 
>>> to many matrix or vector computation, it is better to use Brook+/ 
>>> CAL, which can show great power of AMD gpu.
>>> Regards,
>>> Li, Bo
>>> ----- Original Message -----
>>> From: "Mikhail Kuzminsky" <kus at free.net>
>>> To: "Vincent Diepeveen" <diep at xs4all.nl>
>>> Cc: "Beowulf" <beowulf at beowulf.org>
>>> Sent: Wednesday, August 27, 2008 2:35 AM
>>> Subject: Re: [Beowulf] gpgpu
>>>
>>>
>>>> In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008
>>>> 00:30:30 +0200):
>>>>> Hi Mikhail,
>>>>>
>>>>> I'd say they're ok for black box 32 bits calculations that can do  
>>>>> with
>>>>> a GB or 2 RAM,
>>>>> other than that they're just luxurious electric heating.
>>>>
>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or
>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to
>>>> BLAS/LAPACK/FFT :-)
>>>>
>>>> So I'll need to program something other. People say that the best
>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has  
>>>> about 1
>>>> thousand (or higher) strings in *.cu files. Thereofore I think that 
>>>>a
>>>> bit more difficult alghorithm  as some special matrix 
>>>>diagonalization
>>>> will require a lot of programming work :-(.
>>>>
>>>> It's interesting, that when I read Firestream Brook+ "kernel  
>>>> function"
>>>> source example - for addition of 2 vectors ("Building a High Level
>>>> Language Compiler For GPGPU",
>>>> Bixia Zheng (bixia.zheng at amd.com)
>>>> Derek Gladding (dereked.gladding at amd.com)
>>>> Micah Villmow (micah.villmow at amd.com)
>>>> June 8th, 2008)
>>>>
>>>> - it looks SIMPLE. May be there are a lot of details/source lines
>>>> which were omitted from this example ?
>>>>
>>>>
>>>>> Vincent
>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is  
>>>>> really
>>>>> too much.
>>>>
>>>> 250 W is TDP, the average value declared is about 160 W. I don't
>>>> remember, which GPU - from AMD or Nvidia - has a lot of special
>>>> functional units for sin/cos/exp/etc. If they are not used, may be  
>>>> the
>>>> power will a bit more lower.
>>>>
>>>> What is about Firestream 9250, AMD says about 150 W (although I'm 
>>>>not
>>>> absolutely sure that it's TDP) - it's as for some
>>>> Intel Xeon quad-cores chips w/names beginning from X.
>>>>
>>>> Mikhail
>>>>
>>>>
>>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>>>>
>>>>>> BTW, why GPGPUs are considered as vector systems ?
>>>>>> Taking into account that GPGPUs contain many (equal) execution
>>>>>> units,
>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from
>>>>>> the software tools used (CUDA etc) ?
>>>>>>
>>>>>> Mikhail Kuzminsky
>>>>>> Computer Assistance to Chemical Research Center
>>>>>> Zelinsky Institute of Organic Chemistry
>>>>>> Moscow
>>>>>> _______________________________________________
>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit  
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>
>>