[Beowulf] gpgpu
Li, Bo
libo at buaa.edu.cn
Thu Aug 28 17:15:42 PDT 2008
Yes, Firestream has a great paper performance, but how can you get it?
But for the costs, if you don't mind to use some un-professional components, you can try their gaming cards, much cheaper. We bought NVidia's last flagship card 8800Ultra for 600 Euro, what's a crazy price, and now you can buy two GTX280 for less. If you can bear SP, and you will get 936GFlops for each. And we have achieved 40% of their peak performance, sounds good.
Regards,
Li, Bo
----- Original Message -----
From: "Mikhail Kuzminsky" <kus at free.net>
To: "Li, Bo" <libo at buaa.edu.cn>
Cc: "Vincent Diepeveen" <diep at xs4all.nl>; <beowulf at beowulf.org>
Sent: Friday, August 29, 2008 1:52 AM
Subject: Re: [Beowulf] gpgpu
> In message from "Li, Bo" <libo at buaa.edu.cn> (Thu, 28 Aug 2008 14:20:15
> +0800):
>> ...
>>Currently, the DP performance of GPU is not good as we expected, or
>>only 1/8 1/10 of SP Flops. It is also a problem.
>
> AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W
> for DP. It's 5 times slower than SP.
>
> Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS
> DP. The price will be, I suppose, about $2000 - as for 9170.
>
> Let me look to modern dual socket quad-core beowulf node w/price about
> $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP
> performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100
> GFLOPS.
>
> Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.
> Is it enough for essential calculation speedup - taking into account
> time for data transmission to/from GPU ?
>
>>I would suggest hybrid computation platforms, with GPU, CPU, and
>>processors like Clearspeed. It may be a good topic for programming
>>model.
>
> Clearspeed, if there is no new hardware now, has not enough DP
> performance in comparison w/typical modern servers on quad-core CPUs.
>
> Yours
> Mikhail
>
>>Regards,
>>Li, Bo
>>----- Original Message -----
>>From: "Vincent Diepeveen" <diep at xs4all.nl>
>>To: "Li, Bo" <libo at buaa.edu.cn>
>>Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf"
>><beowulf at beowulf.org>
>>Sent: Thursday, August 28, 2008 12:22 AM
>>Subject: Re: [Beowulf] gpgpu
>>
>>
>>> Hi Bo,
>>>
>>> Thanks for your message.
>>>
>>> What library do i call to find primes?
>>>
>>> Currently it's searching here after primes (PRP's) in the form of p
>>>
>>> = (2^n + 1) / 3
>>>
>>> n is here about 1.5 million bits roughly as we speak.
>>>
>>> For SSE2 type processors there is the George Woltman assembler code
>>>
>>> (MiT) to do the squaring + implicit modulo;
>>> how do you plan to beat that type of real optimized number crunching
>>>
>>> at a GPU?
>>>
>>> You'll have to figure out a way to find an instruction level
>>> parallellism of at least 32,
>>> which also doesn't write to the same cacheline, i *guess* (no
>>> documentation to verify that in fact).
>>>
>>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
>>>
>>> In fact the first problem to solve is to do some sort of squaring
>>> real quickly.
>>>
>>> If you figured that out at a PC, experience learns you're still
>>> losing a potential of factor 8,
>>> thanks to another zillion optimizations.
>>>
>>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver
>>>
>>> on paper @ 250 watt TDP (you bet it will consume that
>>> when you let it work so hard) means GPU delivers effectively less
>>> than 7 gflops double precision thanks to inefficient code.
>>>
>>> Additionally remember the P4. On paper in integers claim was when it
>>>
>>> released it would be able to execute 4 integers a
>>> cycle, reality is that it was a processor getting an IPC far under 1
>>>
>>> for most integer codes. All kind of stuff sucked at it.
>>>
>>> The experience learns this is the same for todays GPU's, the
>>> scientists who have run codes on it so far and are really
>>>experienced
>>> CUDA programmers, figured out the speed it delivers is a very big
>>> bummer.
>>>
>>> Additionally 250 watt TDP for massive number crunching is too much.
>>>
>>> It's well over factor 2 power consumption of a quadcore. Now i can
>>> take a look soon in China myself what power prices
>>> are over there, but i can assure you they will rise soon.
>>>
>>> Now that's a lot less than a quadcore delivers with a tdp far under
>>>
>>> 100 watt.
>>>
>>> Now i explicitly mention the n's i'm searching here, as it should
>>>fit
>>> within caches.
>>> So the very secret bandwidth you can practical achieve (as we know
>>> nvidia lobotomized
>>> bandwidth in the GPU cards, only the Tesla type seems to be not
>>> lobotomized),
>>> i'm not even teasing you with that.
>>>
>>> This is true for any type of code. You're losing it to the details.
>>>
>>> Only custom tailored solutions will work,
>>> simply because they're factors faster.
>>>
>>> Thanks,
>>> Vincent
>>>
>>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
>>>
>>>> Hello,
>>>> IMHO, it is better to call the BLAS or similiar libarary rather
>>>> than programing you own functions. And CUDA treats the GPU as a
>>>> cluster, so .CU is not working as our normal codes. If you have got
>>>>
>>>> to many matrix or vector computation, it is better to use Brook+/
>>>> CAL, which can show great power of AMD gpu.
>>>> Regards,
>>>> Li, Bo
>>>> ----- Original Message -----
>>>> From: "Mikhail Kuzminsky" <kus at free.net>
>>>> To: "Vincent Diepeveen" <diep at xs4all.nl>
>>>> Cc: "Beowulf" <beowulf at beowulf.org>
>>>> Sent: Wednesday, August 27, 2008 2:35 AM
>>>> Subject: Re: [Beowulf] gpgpu
>>>>
>>>>
>>>>> In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008
>>>>> 00:30:30 +0200):
>>>>>> Hi Mikhail,
>>>>>>
>>>>>> I'd say they're ok for black box 32 bits calculations that can do
>>>>>> with
>>>>>> a GB or 2 RAM,
>>>>>> other than that they're just luxurious electric heating.
>>>>>
>>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or
>>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to
>>>>> BLAS/LAPACK/FFT :-)
>>>>>
>>>>> So I'll need to program something other. People say that the best
>>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has
>>>>> about 1
>>>>> thousand (or higher) strings in *.cu files. Thereofore I think that
>>>>>a
>>>>> bit more difficult alghorithm as some special matrix
>>>>>diagonalization
>>>>> will require a lot of programming work :-(.
>>>>>
>>>>> It's interesting, that when I read Firestream Brook+ "kernel
>>>>> function"
>>>>> source example - for addition of 2 vectors ("Building a High Level
>>>>> Language Compiler For GPGPU",
>>>>> Bixia Zheng (bixia.zheng at amd.com)
>>>>> Derek Gladding (dereked.gladding at amd.com)
>>>>> Micah Villmow (micah.villmow at amd.com)
>>>>> June 8th, 2008)
>>>>>
>>>>> - it looks SIMPLE. May be there are a lot of details/source lines
>>>>> which were omitted from this example ?
>>>>>
>>>>>
>>>>>> Vincent
>>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is
>>>>>> really
>>>>>> too much.
>>>>>
>>>>> 250 W is TDP, the average value declared is about 160 W. I don't
>>>>> remember, which GPU - from AMD or Nvidia - has a lot of special
>>>>> functional units for sin/cos/exp/etc. If they are not used, may be
>>>>> the
>>>>> power will a bit more lower.
>>>>>
>>>>> What is about Firestream 9250, AMD says about 150 W (although I'm
>>>>>not
>>>>> absolutely sure that it's TDP) - it's as for some
>>>>> Intel Xeon quad-cores chips w/names beginning from X.
>>>>>
>>>>> Mikhail
>>>>>
>>>>>
>>>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>>>>>
>>>>>>> BTW, why GPGPUs are considered as vector systems ?
>>>>>>> Taking into account that GPGPUs contain many (equal) execution
>>>>>>> units,
>>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from
>>>>>>> the software tools used (CUDA etc) ?
>>>>>>>
>>>>>>> Mikhail Kuzminsky
>>>>>>> Computer Assistance to Chemical Research Center
>>>>>>> Zelinsky Institute of Organic Chemistry
>>>>>>> Moscow
>>>>>>> _______________________________________________
>>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>>
>>>
>
More information about the Beowulf
mailing list