[Beowulf] gpgpu

Fri Aug 29 05:01:13 PDT 2008

In message from "Li, Bo" <libo at buaa.edu.cn> (Fri, 29 Aug 2008 08:15:42 
+0800):
>Yes, Firestream has a great paper performance, but how can you get 
>it?
>But for the costs, if you don't mind to use some un-professional 
>components, you can try their gaming cards, much cheaper. We bought 
>NVidia's last flagship card 8800Ultra for 600 Euro, what's a crazy 
>price, and now you can buy two GTX280 for less. If you can bear SP, 
>and you will get 936GFlops for each. And we have achieved 40% of 
>their peak performance, sounds good.

But which percent of peak value you may have on x86 CPU ?
If it's something like sgemm, then it looks not too attractive for me 
:-( :
on usual x86 I may obtain about 90% of peak performance, plus 
performance difference of Xeon/Opteron CPUs w/GPU on DP is not too 
high :-( 

Mikhail

>Regards,
>Li, Bo
>----- Original Message ----- 
>From: "Mikhail Kuzminsky" <kus at free.net>
>To: "Li, Bo" <libo at buaa.edu.cn>
>Cc: "Vincent Diepeveen" <diep at xs4all.nl>; <beowulf at beowulf.org>
>Sent: Friday, August 29, 2008 1:52 AM
>Subject: Re: [Beowulf] gpgpu
>
>
>> In message from "Li, Bo" <libo at buaa.edu.cn> (Thu, 28 Aug 2008 
>>14:20:15 
>> +0800):
>>> ...
>>>Currently, the DP performance of GPU is not good as we expected, or 
>>>only 1/8 1/10 of SP Flops. It is also a problem.
>> 
>> AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W 
>> for DP. It's 5 times slower than SP.
>> 
>> Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 
>>GFLOPS 
>> DP. The price will be, I suppose, about $2000 - as for 9170.
>> 
>> Let me look to modern dual socket quad-core beowulf node w/price 
>>about 
>> $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP 
>> performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 
>> GFLOPS. 
>> 
>> Therefore GPGPU peak DP performance is 1.5-2 times higher than 
>>w/CPUs.
>> Is it enough for essential calculation speedup - taking into account 
>> time for data transmission to/from GPU ?    
>> 
>>>I would suggest hybrid computation platforms, with GPU, CPU, and 
>>>processors like Clearspeed. It may be a good topic for programming 
>>>model.
>> 
>> Clearspeed, if there is no new hardware now, has not enough DP 
>> performance in comparison w/typical modern servers on quad-core 
>>CPUs.
>> 
>> Yours
>> Mikhail 
>> 
>>>Regards,
>>>Li, Bo
>>>----- Original Message ----- 
>>>From: "Vincent Diepeveen" <diep at xs4all.nl>
>>>To: "Li, Bo" <libo at buaa.edu.cn>
>>>Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf" 
>>><beowulf at beowulf.org>
>>>Sent: Thursday, August 28, 2008 12:22 AM
>>>Subject: Re: [Beowulf] gpgpu
>>>
>>>
>>>> Hi Bo,
>>>> 
>>>> Thanks for your message.
>>>> 
>>>> What library do i call to find primes?
>>>> 
>>>> Currently it's searching here after primes (PRP's)  in the form of p 
>>>> 
>>>> = (2^n + 1) / 3
>>>> 
>>>> n is here about 1.5 million bits roughly as we speak.
>>>> 
>>>> For SSE2 type processors there is the George Woltman assembler code 
>>>> 
>>>> (MiT) to do the squaring + implicit modulo;
>>>> how do you plan to beat that type of real optimized number crunching 
>>>> 
>>>> at a GPU?
>>>> 
>>>> You'll have to figure out a way to find an instruction level  
>>>> parallellism of at least 32,
>>>> which also doesn't write to the same cacheline, i *guess* (no  
>>>> documentation to verify that in fact).
>>>> 
>>>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
>>>> 
>>>> In fact the first problem to solve is to do some sort of squaring  
>>>> real quickly.
>>>> 
>>>> If you figured that out at a PC, experience learns you're still  
>>>> losing a potential of factor 8,
>>>> thanks to another zillion optimizations.
>>>> 
>>>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver 
>>>> 
>>>> on paper @ 250 watt TDP (you bet it will consume that
>>>> when you let it work so hard) means GPU delivers effectively less  
>>>> than 7 gflops double precision thanks to inefficient code.
>>>> 
>>>> Additionally remember the P4. On paper in integers claim was when it 
>>>> 
>>>> released it would be able to execute 4 integers a
>>>> cycle, reality is that it was a processor getting an IPC far under 1 
>>>> 
>>>> for most integer codes. All kind of stuff sucked at it.
>>>> 
>>>> The experience learns this is the same for todays GPU's, the  
>>>> scientists who have run codes on it so far and are really 
>>>>experienced
>>>> CUDA programmers, figured out the speed it delivers is a very big  
>>>> bummer.
>>>> 
>>>> Additionally 250 watt TDP for massive number crunching is too much.
>>>> 
>>>> It's well over factor 2 power consumption of a quadcore. Now i can  
>>>> take a look soon in China myself what power prices
>>>> are over there, but i can assure you they will rise soon.
>>>> 
>>>> Now that's a lot less than a quadcore delivers with a tdp far under 
>>>> 
>>>> 100 watt.
>>>> 
>>>> Now i explicitly mention the n's i'm searching here, as it should 
>>>>fit  
>>>> within caches.
>>>> So the very secret bandwidth you can practical achieve (as we know  
>>>> nvidia lobotomized
>>>> bandwidth in the GPU cards, only the Tesla type seems to be not  
>>>> lobotomized),
>>>> i'm not even teasing you with that.
>>>> 
>>>> This is true for any type of code. You're losing it to the details. 
>>>> 
>>>> Only custom tailored solutions will work,
>>>> simply because they're factors faster.
>>>> 
>>>> Thanks,
>>>> Vincent
>>>> 
>>>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
>>>> 
>>>>> Hello,
>>>>> IMHO, it is better to call the BLAS or similiar libarary rather  
>>>>> than programing you own functions. And CUDA treats the GPU as a  
>>>>> cluster, so .CU is not working as our normal codes. If you have got 
>>>>> 
>>>>> to many matrix or vector computation, it is better to use Brook+/ 
>>>>> CAL, which can show great power of AMD gpu.
>>>>> Regards,
>>>>> Li, Bo
>>>>> ----- Original Message -----
>>>>> From: "Mikhail Kuzminsky" <kus at free.net>
>>>>> To: "Vincent Diepeveen" <diep at xs4all.nl>
>>>>> Cc: "Beowulf" <beowulf at beowulf.org>
>>>>> Sent: Wednesday, August 27, 2008 2:35 AM
>>>>> Subject: Re: [Beowulf] gpgpu
>>>>>
>>>>>
>>>>>> In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008
>>>>>> 00:30:30 +0200):
>>>>>>> Hi Mikhail,
>>>>>>>
>>>>>>> I'd say they're ok for black box 32 bits calculations that can do  
>>>>>>> with
>>>>>>> a GB or 2 RAM,
>>>>>>> other than that they're just luxurious electric heating.
>>>>>>
>>>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or
>>>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to
>>>>>> BLAS/LAPACK/FFT :-)
>>>>>>
>>>>>> So I'll need to program something other. People say that the best
>>>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has  
>>>>>> about 1
>>>>>> thousand (or higher) strings in *.cu files. Thereofore I think that 
>>>>>>a
>>>>>> bit more difficult alghorithm  as some special matrix 
>>>>>>diagonalization
>>>>>> will require a lot of programming work :-(.
>>>>>>
>>>>>> It's interesting, that when I read Firestream Brook+ "kernel  
>>>>>> function"
>>>>>> source example - for addition of 2 vectors ("Building a High Level
>>>>>> Language Compiler For GPGPU",
>>>>>> Bixia Zheng (bixia.zheng at amd.com)
>>>>>> Derek Gladding (dereked.gladding at amd.com)
>>>>>> Micah Villmow (micah.villmow at amd.com)
>>>>>> June 8th, 2008)
>>>>>>
>>>>>> - it looks SIMPLE. May be there are a lot of details/source lines
>>>>>> which were omitted from this example ?
>>>>>>
>>>>>>
>>>>>>> Vincent
>>>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is  
>>>>>>> really
>>>>>>> too much.
>>>>>>
>>>>>> 250 W is TDP, the average value declared is about 160 W. I don't
>>>>>> remember, which GPU - from AMD or Nvidia - has a lot of special
>>>>>> functional units for sin/cos/exp/etc. If they are not used, may be  
>>>>>> the
>>>>>> power will a bit more lower.
>>>>>>
>>>>>> What is about Firestream 9250, AMD says about 150 W (although I'm 
>>>>>>not
>>>>>> absolutely sure that it's TDP) - it's as for some
>>>>>> Intel Xeon quad-cores chips w/names beginning from X.
>>>>>>
>>>>>> Mikhail
>>>>>>
>>>>>>
>>>>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>>>>>>
>>>>>>>> BTW, why GPGPUs are considered as vector systems ?
>>>>>>>> Taking into account that GPGPUs contain many (equal) execution
>>>>>>>> units,
>>>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from
>>>>>>>> the software tools used (CUDA etc) ?
>>>>>>>>
>>>>>>>> Mikhail Kuzminsky
>>>>>>>> Computer Assistance to Chemical Research Center
>>>>>>>> Zelinsky Institute of Organic Chemistry
>>>>>>>> Moscow
>>>>>>>> _______________________________________________
>>>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>> To change your subscription (digest mode or unsubscribe) visit  
>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>
>>>>>
>>>>
>>