[Beowulf] gpgpu

Fri Aug 29 08:04:18 PDT 2008

In this problem:
http://arstechnica.com/news.ars/post/20080430-ps3s-cell-cpu-tops-high-performance-computing-benchmark.html

They obtained at maximum 30% of peak performance in x86 processors. In Cell
and niagara2 they obtained about 60% peak performance.
It seems that in memory intensive codes, the processor must have massive
memory bandwidth to get near the peak performance.

2008/8/29 Mikhail Kuzminsky <kus at free.net>

> In message from "Li, Bo" <libo at buaa.edu.cn> (Fri, 29 Aug 2008 08:15:42
> +0800):
>
>> Yes, Firestream has a great paper performance, but how can you get it?
>> But for the costs, if you don't mind to use some un-professional
>> components, you can try their gaming cards, much cheaper. We bought NVidia's
>> last flagship card 8800Ultra for 600 Euro, what's a crazy price, and now you
>> can buy two GTX280 for less. If you can bear SP, and you will get 936GFlops
>> for each. And we have achieved 40% of their peak performance, sounds good.
>>
>
> But which percent of peak value you may have on x86 CPU ?
> If it's something like sgemm, then it looks not too attractive for me :-( :
> on usual x86 I may obtain about 90% of peak performance, plus performance
> difference of Xeon/Opteron CPUs w/GPU on DP is not too high :-(
> Mikhail
>
>
>  Regards,
>> Li, Bo
>> ----- Original Message ----- From: "Mikhail Kuzminsky" <kus at free.net>
>> To: "Li, Bo" <libo at buaa.edu.cn>
>> Cc: "Vincent Diepeveen" <diep at xs4all.nl>; <beowulf at beowulf.org>
>> Sent: Friday, August 29, 2008 1:52 AM
>> Subject: Re: [Beowulf] gpgpu
>>
>>
>>  In message from "Li, Bo" <libo at buaa.edu.cn> (Thu, 28 Aug 2008 14:20:15
>>> +0800):
>>>
>>>> ...
>>>> Currently, the DP performance of GPU is not good as we expected, or only
>>>> 1/8 1/10 of SP Flops. It is also a problem.
>>>>
>>>
>>> AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W for
>>> DP. It's 5 times slower than SP.
>>>
>>> Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS
>>> DP. The price will be, I suppose, about $2000 - as for 9170.
>>>
>>> Let me look to modern dual socket quad-core beowulf node w/price about
>>> $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP
>>> performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 GFLOPS.
>>>
>>> Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.
>>> Is it enough for essential calculation speedup - taking into account time
>>> for data transmission to/from GPU ?
>>>
>>>> I would suggest hybrid computation platforms, with GPU, CPU, and
>>>> processors like Clearspeed. It may be a good topic for programming model.
>>>>
>>>
>>> Clearspeed, if there is no new hardware now, has not enough DP
>>> performance in comparison w/typical modern servers on quad-core CPUs.
>>>
>>> Yours
>>> Mikhail
>>>
>>>> Regards,
>>>> Li, Bo
>>>> ----- Original Message ----- From: "Vincent Diepeveen" <diep at xs4all.nl>
>>>> To: "Li, Bo" <libo at buaa.edu.cn>
>>>> Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf" <beowulf at beowulf.org>
>>>> Sent: Thursday, August 28, 2008 12:22 AM
>>>> Subject: Re: [Beowulf] gpgpu
>>>>
>>>>
>>>>  Hi Bo,
>>>>>
>>>>> Thanks for your message.
>>>>>
>>>>> What library do i call to find primes?
>>>>>
>>>>> Currently it's searching here after primes (PRP's)  in the form of p
>>>>> = (2^n + 1) / 3
>>>>>
>>>>> n is here about 1.5 million bits roughly as we speak.
>>>>>
>>>>> For SSE2 type processors there is the George Woltman assembler code
>>>>> (MiT) to do the squaring + implicit modulo;
>>>>> how do you plan to beat that type of real optimized number crunching
>>>>> at a GPU?
>>>>>
>>>>> You'll have to figure out a way to find an instruction level
>>>>>  parallellism of at least 32,
>>>>> which also doesn't write to the same cacheline, i *guess* (no
>>>>>  documentation to verify that in fact).
>>>>>
>>>>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
>>>>>
>>>>> In fact the first problem to solve is to do some sort of squaring  real
>>>>> quickly.
>>>>>
>>>>> If you figured that out at a PC, experience learns you're still  losing
>>>>> a potential of factor 8,
>>>>> thanks to another zillion optimizations.
>>>>>
>>>>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver
>>>>> on paper @ 250 watt TDP (you bet it will consume that
>>>>> when you let it work so hard) means GPU delivers effectively less  than
>>>>> 7 gflops double precision thanks to inefficient code.
>>>>>
>>>>> Additionally remember the P4. On paper in integers claim was when it
>>>>> released it would be able to execute 4 integers a
>>>>> cycle, reality is that it was a processor getting an IPC far under 1
>>>>> for most integer codes. All kind of stuff sucked at it.
>>>>>
>>>>> The experience learns this is the same for todays GPU's, the
>>>>>  scientists who have run codes on it so far and are really experienced
>>>>> CUDA programmers, figured out the speed it delivers is a very big
>>>>>  bummer.
>>>>>
>>>>> Additionally 250 watt TDP for massive number crunching is too much.
>>>>>
>>>>> It's well over factor 2 power consumption of a quadcore. Now i can
>>>>>  take a look soon in China myself what power prices
>>>>> are over there, but i can assure you they will rise soon.
>>>>>
>>>>> Now that's a lot less than a quadcore delivers with a tdp far under
>>>>> 100 watt.
>>>>>
>>>>> Now i explicitly mention the n's i'm searching here, as it should fit
>>>>>  within caches.
>>>>> So the very secret bandwidth you can practical achieve (as we know
>>>>>  nvidia lobotomized
>>>>> bandwidth in the GPU cards, only the Tesla type seems to be not
>>>>>  lobotomized),
>>>>> i'm not even teasing you with that.
>>>>>
>>>>> This is true for any type of code. You're losing it to the details.
>>>>> Only custom tailored solutions will work,
>>>>> simply because they're factors faster.
>>>>>
>>>>> Thanks,
>>>>> Vincent
>>>>>
>>>>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
>>>>>
>>>>>  Hello,
>>>>>> IMHO, it is better to call the BLAS or similiar libarary rather  than
>>>>>> programing you own functions. And CUDA treats the GPU as a  cluster, so .CU
>>>>>> is not working as our normal codes. If you have got
>>>>>> to many matrix or vector computation, it is better to use Brook+/ CAL,
>>>>>> which can show great power of AMD gpu.
>>>>>> Regards,
>>>>>> Li, Bo
>>>>>> ----- Original Message -----
>>>>>> From: "Mikhail Kuzminsky" <kus at free.net>
>>>>>> To: "Vincent Diepeveen" <diep at xs4all.nl>
>>>>>> Cc: "Beowulf" <beowulf at beowulf.org>
>>>>>> Sent: Wednesday, August 27, 2008 2:35 AM
>>>>>> Subject: Re: [Beowulf] gpgpu
>>>>>>
>>>>>>
>>>>>>  In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008
>>>>>>> 00:30:30 +0200):
>>>>>>>
>>>>>>>> Hi Mikhail,
>>>>>>>>
>>>>>>>> I'd say they're ok for black box 32 bits calculations that can do
>>>>>>>>  with
>>>>>>>> a GB or 2 RAM,
>>>>>>>> other than that they're just luxurious electric heating.
>>>>>>>>
>>>>>>>
>>>>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or
>>>>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to
>>>>>>> BLAS/LAPACK/FFT :-)
>>>>>>>
>>>>>>> So I'll need to program something other. People say that the best
>>>>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has  about
>>>>>>> 1
>>>>>>> thousand (or higher) strings in *.cu files. Thereofore I think that a
>>>>>>> bit more difficult alghorithm  as some special matrix diagonalization
>>>>>>> will require a lot of programming work :-(.
>>>>>>>
>>>>>>> It's interesting, that when I read Firestream Brook+ "kernel
>>>>>>>  function"
>>>>>>> source example - for addition of 2 vectors ("Building a High Level
>>>>>>> Language Compiler For GPGPU",
>>>>>>> Bixia Zheng (bixia.zheng at amd.com)
>>>>>>> Derek Gladding (dereked.gladding at amd.com)
>>>>>>> Micah Villmow (micah.villmow at amd.com)
>>>>>>> June 8th, 2008)
>>>>>>>
>>>>>>> - it looks SIMPLE. May be there are a lot of details/source lines
>>>>>>> which were omitted from this example ?
>>>>>>>
>>>>>>>
>>>>>>>  Vincent
>>>>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is
>>>>>>>>  really
>>>>>>>> too much.
>>>>>>>>
>>>>>>>
>>>>>>> 250 W is TDP, the average value declared is about 160 W. I don't
>>>>>>> remember, which GPU - from AMD or Nvidia - has a lot of special
>>>>>>> functional units for sin/cos/exp/etc. If they are not used, may be
>>>>>>>  the
>>>>>>> power will a bit more lower.
>>>>>>>
>>>>>>> What is about Firestream 9250, AMD says about 150 W (although I'm not
>>>>>>> absolutely sure that it's TDP) - it's as for some
>>>>>>> Intel Xeon quad-cores chips w/names beginning from X.
>>>>>>>
>>>>>>> Mikhail
>>>>>>>
>>>>>>>
>>>>>>>  On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>>>>>>>
>>>>>>>>  BTW, why GPGPUs are considered as vector systems ?
>>>>>>>>> Taking into account that GPGPUs contain many (equal) execution
>>>>>>>>> units,
>>>>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from
>>>>>>>>> the software tools used (CUDA etc) ?
>>>>>>>>>
>>>>>>>>> Mikhail Kuzminsky
>>>>>>>>> Computer Assistance to Chemical Research Center
>>>>>>>>> Zelinsky Institute of Organic Chemistry
>>>>>>>>> Moscow
>>>>>>>>> _______________________________________________
>>>>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080829/6e366585/attachment.html>