[Beowulf] gpgpu

Thu Aug 28 17:15:42 PDT 2008

Yes, Firestream has a great paper performance, but how can you get it?
But for the costs, if you don't mind to use some un-professional components, you can try their gaming cards, much cheaper. We bought NVidia's last flagship card 8800Ultra for 600 Euro, what's a crazy price, and now you can buy two GTX280 for less. If you can bear SP, and you will get 936GFlops for each. And we have achieved 40% of their peak performance, sounds good.
Regards,
Li, Bo
----- Original Message ----- 
From: "Mikhail Kuzminsky" <kus at free.net>
To: "Li, Bo" <libo at buaa.edu.cn>
Cc: "Vincent Diepeveen" <diep at xs4all.nl>; <beowulf at beowulf.org>
Sent: Friday, August 29, 2008 1:52 AM
Subject: Re: [Beowulf] gpgpu


> In message from "Li, Bo" <libo at buaa.edu.cn> (Thu, 28 Aug 2008 14:20:15 
> +0800):
>> ...
>>Currently, the DP performance of GPU is not good as we expected, or 
>>only 1/8 1/10 of SP Flops. It is also a problem.
> 
> AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W 
> for DP. It's 5 times slower than SP.
> 
> Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS 
> DP. The price will be, I suppose, about $2000 - as for 9170.
> 
> Let me look to modern dual socket quad-core beowulf node w/price about 
> $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP 
> performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 
> GFLOPS. 
> 
> Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.
> Is it enough for essential calculation speedup - taking into account 
> time for data transmission to/from GPU ?    
> 
>>I would suggest hybrid computation platforms, with GPU, CPU, and 
>>processors like Clearspeed. It may be a good topic for programming 
>>model.
> 
> Clearspeed, if there is no new hardware now, has not enough DP 
> performance in comparison w/typical modern servers on quad-core CPUs.
> 
> Yours
> Mikhail 
> 
>>Regards,
>>Li, Bo
>>----- Original Message ----- 
>>From: "Vincent Diepeveen" <diep at xs4all.nl>
>>To: "Li, Bo" <libo at buaa.edu.cn>
>>Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf" 
>><beowulf at beowulf.org>
>>Sent: Thursday, August 28, 2008 12:22 AM
>>Subject: Re: [Beowulf] gpgpu
>>
>>
>>> Hi Bo,
>>> 
>>> Thanks for your message.
>>> 
>>> What library do i call to find primes?
>>> 
>>> Currently it's searching here after primes (PRP's)  in the form of p 
>>> 
>>> = (2^n + 1) / 3
>>> 
>>> n is here about 1.5 million bits roughly as we speak.
>>> 
>>> For SSE2 type processors there is the George Woltman assembler code 
>>> 
>>> (MiT) to do the squaring + implicit modulo;
>>> how do you plan to beat that type of real optimized number crunching 
>>> 
>>> at a GPU?
>>> 
>>> You'll have to figure out a way to find an instruction level  
>>> parallellism of at least 32,
>>> which also doesn't write to the same cacheline, i *guess* (no  
>>> documentation to verify that in fact).
>>> 
>>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
>>> 
>>> In fact the first problem to solve is to do some sort of squaring  
>>> real quickly.
>>> 
>>> If you figured that out at a PC, experience learns you're still  
>>> losing a potential of factor 8,
>>> thanks to another zillion optimizations.
>>> 
>>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver 
>>> 
>>> on paper @ 250 watt TDP (you bet it will consume that
>>> when you let it work so hard) means GPU delivers effectively less  
>>> than 7 gflops double precision thanks to inefficient code.
>>> 
>>> Additionally remember the P4. On paper in integers claim was when it 
>>> 
>>> released it would be able to execute 4 integers a
>>> cycle, reality is that it was a processor getting an IPC far under 1 
>>> 
>>> for most integer codes. All kind of stuff sucked at it.
>>> 
>>> The experience learns this is the same for todays GPU's, the  
>>> scientists who have run codes on it so far and are really 
>>>experienced
>>> CUDA programmers, figured out the speed it delivers is a very big  
>>> bummer.
>>> 
>>> Additionally 250 watt TDP for massive number crunching is too much.
>>> 
>>> It's well over factor 2 power consumption of a quadcore. Now i can  
>>> take a look soon in China myself what power prices
>>> are over there, but i can assure you they will rise soon.
>>> 
>>> Now that's a lot less than a quadcore delivers with a tdp far under 
>>> 
>>> 100 watt.
>>> 
>>> Now i explicitly mention the n's i'm searching here, as it should 
>>>fit  
>>> within caches.
>>> So the very secret bandwidth you can practical achieve (as we know  
>>> nvidia lobotomized
>>> bandwidth in the GPU cards, only the Tesla type seems to be not  
>>> lobotomized),
>>> i'm not even teasing you with that.
>>> 
>>> This is true for any type of code. You're losing it to the details. 
>>> 
>>> Only custom tailored solutions will work,
>>> simply because they're factors faster.
>>> 
>>> Thanks,
>>> Vincent
>>> 
>>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
>>> 
>>>> Hello,
>>>> IMHO, it is better to call the BLAS or similiar libarary rather  
>>>> than programing you own functions. And CUDA treats the GPU as a  
>>>> cluster, so .CU is not working as our normal codes. If you have got 
>>>> 
>>>> to many matrix or vector computation, it is better to use Brook+/ 
>>>> CAL, which can show great power of AMD gpu.
>>>> Regards,
>>>> Li, Bo
>>>> ----- Original Message -----
>>>> From: "Mikhail Kuzminsky" <kus at free.net>
>>>> To: "Vincent Diepeveen" <diep at xs4all.nl>
>>>> Cc: "Beowulf" <beowulf at beowulf.org>
>>>> Sent: Wednesday, August 27, 2008 2:35 AM
>>>> Subject: Re: [Beowulf] gpgpu
>>>>
>>>>
>>>>> In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008
>>>>> 00:30:30 +0200):
>>>>>> Hi Mikhail,
>>>>>>
>>>>>> I'd say they're ok for black box 32 bits calculations that can do  
>>>>>> with
>>>>>> a GB or 2 RAM,
>>>>>> other than that they're just luxurious electric heating.
>>>>>
>>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or
>>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to
>>>>> BLAS/LAPACK/FFT :-)
>>>>>
>>>>> So I'll need to program something other. People say that the best
>>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has  
>>>>> about 1
>>>>> thousand (or higher) strings in *.cu files. Thereofore I think that 
>>>>>a
>>>>> bit more difficult alghorithm  as some special matrix 
>>>>>diagonalization
>>>>> will require a lot of programming work :-(.
>>>>>
>>>>> It's interesting, that when I read Firestream Brook+ "kernel  
>>>>> function"
>>>>> source example - for addition of 2 vectors ("Building a High Level
>>>>> Language Compiler For GPGPU",
>>>>> Bixia Zheng (bixia.zheng at amd.com)
>>>>> Derek Gladding (dereked.gladding at amd.com)
>>>>> Micah Villmow (micah.villmow at amd.com)
>>>>> June 8th, 2008)
>>>>>
>>>>> - it looks SIMPLE. May be there are a lot of details/source lines
>>>>> which were omitted from this example ?
>>>>>
>>>>>
>>>>>> Vincent
>>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is  
>>>>>> really
>>>>>> too much.
>>>>>
>>>>> 250 W is TDP, the average value declared is about 160 W. I don't
>>>>> remember, which GPU - from AMD or Nvidia - has a lot of special
>>>>> functional units for sin/cos/exp/etc. If they are not used, may be  
>>>>> the
>>>>> power will a bit more lower.
>>>>>
>>>>> What is about Firestream 9250, AMD says about 150 W (although I'm 
>>>>>not
>>>>> absolutely sure that it's TDP) - it's as for some
>>>>> Intel Xeon quad-cores chips w/names beginning from X.
>>>>>
>>>>> Mikhail
>>>>>
>>>>>
>>>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>>>>>
>>>>>>> BTW, why GPGPUs are considered as vector systems ?
>>>>>>> Taking into account that GPGPUs contain many (equal) execution
>>>>>>> units,
>>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from
>>>>>>> the software tools used (CUDA etc) ?
>>>>>>>
>>>>>>> Mikhail Kuzminsky
>>>>>>> Computer Assistance to Chemical Research Center
>>>>>>> Zelinsky Institute of Organic Chemistry
>>>>>>> Moscow
>>>>>>> _______________________________________________
>>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>> To change your subscription (digest mode or unsubscribe) visit  
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>>
>>>
>