[Beowulf] gpgpu
Mikhail Kuzminsky
kus at free.net
Thu Aug 28 10:52:23 PDT 2008
In message from "Li, Bo" <libo at buaa.edu.cn> (Thu, 28 Aug 2008 14:20:15
+0800):
> ...
>Currently, the DP performance of GPU is not good as we expected, or
>only 1/8 1/10 of SP Flops. It is also a problem.
AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W
for DP. It's 5 times slower than SP.
Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS
DP. The price will be, I suppose, about $2000 - as for 9170.
Let me look to modern dual socket quad-core beowulf node w/price about
$4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP
performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100
GFLOPS.
Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.
Is it enough for essential calculation speedup - taking into account
time for data transmission to/from GPU ?
>I would suggest hybrid computation platforms, with GPU, CPU, and
>processors like Clearspeed. It may be a good topic for programming
>model.
Clearspeed, if there is no new hardware now, has not enough DP
performance in comparison w/typical modern servers on quad-core CPUs.
Yours
Mikhail
>Regards,
>Li, Bo
>----- Original Message -----
>From: "Vincent Diepeveen" <diep at xs4all.nl>
>To: "Li, Bo" <libo at buaa.edu.cn>
>Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf"
><beowulf at beowulf.org>
>Sent: Thursday, August 28, 2008 12:22 AM
>Subject: Re: [Beowulf] gpgpu
>
>
>> Hi Bo,
>>
>> Thanks for your message.
>>
>> What library do i call to find primes?
>>
>> Currently it's searching here after primes (PRP's) in the form of p
>>
>> = (2^n + 1) / 3
>>
>> n is here about 1.5 million bits roughly as we speak.
>>
>> For SSE2 type processors there is the George Woltman assembler code
>>
>> (MiT) to do the squaring + implicit modulo;
>> how do you plan to beat that type of real optimized number crunching
>>
>> at a GPU?
>>
>> You'll have to figure out a way to find an instruction level
>> parallellism of at least 32,
>> which also doesn't write to the same cacheline, i *guess* (no
>> documentation to verify that in fact).
>>
>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
>>
>> In fact the first problem to solve is to do some sort of squaring
>> real quickly.
>>
>> If you figured that out at a PC, experience learns you're still
>> losing a potential of factor 8,
>> thanks to another zillion optimizations.
>>
>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver
>>
>> on paper @ 250 watt TDP (you bet it will consume that
>> when you let it work so hard) means GPU delivers effectively less
>> than 7 gflops double precision thanks to inefficient code.
>>
>> Additionally remember the P4. On paper in integers claim was when it
>>
>> released it would be able to execute 4 integers a
>> cycle, reality is that it was a processor getting an IPC far under 1
>>
>> for most integer codes. All kind of stuff sucked at it.
>>
>> The experience learns this is the same for todays GPU's, the
>> scientists who have run codes on it so far and are really
>>experienced
>> CUDA programmers, figured out the speed it delivers is a very big
>> bummer.
>>
>> Additionally 250 watt TDP for massive number crunching is too much.
>>
>> It's well over factor 2 power consumption of a quadcore. Now i can
>> take a look soon in China myself what power prices
>> are over there, but i can assure you they will rise soon.
>>
>> Now that's a lot less than a quadcore delivers with a tdp far under
>>
>> 100 watt.
>>
>> Now i explicitly mention the n's i'm searching here, as it should
>>fit
>> within caches.
>> So the very secret bandwidth you can practical achieve (as we know
>> nvidia lobotomized
>> bandwidth in the GPU cards, only the Tesla type seems to be not
>> lobotomized),
>> i'm not even teasing you with that.
>>
>> This is true for any type of code. You're losing it to the details.
>>
>> Only custom tailored solutions will work,
>> simply because they're factors faster.
>>
>> Thanks,
>> Vincent
>>
>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
>>
>>> Hello,
>>> IMHO, it is better to call the BLAS or similiar libarary rather
>>> than programing you own functions. And CUDA treats the GPU as a
>>> cluster, so .CU is not working as our normal codes. If you have got
>>>
>>> to many matrix or vector computation, it is better to use Brook+/
>>> CAL, which can show great power of AMD gpu.
>>> Regards,
>>> Li, Bo
>>> ----- Original Message -----
>>> From: "Mikhail Kuzminsky" <kus at free.net>
>>> To: "Vincent Diepeveen" <diep at xs4all.nl>
>>> Cc: "Beowulf" <beowulf at beowulf.org>
>>> Sent: Wednesday, August 27, 2008 2:35 AM
>>> Subject: Re: [Beowulf] gpgpu
>>>
>>>
>>>> In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008
>>>> 00:30:30 +0200):
>>>>> Hi Mikhail,
>>>>>
>>>>> I'd say they're ok for black box 32 bits calculations that can do
>>>>> with
>>>>> a GB or 2 RAM,
>>>>> other than that they're just luxurious electric heating.
>>>>
>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or
>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to
>>>> BLAS/LAPACK/FFT :-)
>>>>
>>>> So I'll need to program something other. People say that the best
>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has
>>>> about 1
>>>> thousand (or higher) strings in *.cu files. Thereofore I think that
>>>>a
>>>> bit more difficult alghorithm as some special matrix
>>>>diagonalization
>>>> will require a lot of programming work :-(.
>>>>
>>>> It's interesting, that when I read Firestream Brook+ "kernel
>>>> function"
>>>> source example - for addition of 2 vectors ("Building a High Level
>>>> Language Compiler For GPGPU",
>>>> Bixia Zheng (bixia.zheng at amd.com)
>>>> Derek Gladding (dereked.gladding at amd.com)
>>>> Micah Villmow (micah.villmow at amd.com)
>>>> June 8th, 2008)
>>>>
>>>> - it looks SIMPLE. May be there are a lot of details/source lines
>>>> which were omitted from this example ?
>>>>
>>>>
>>>>> Vincent
>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is
>>>>> really
>>>>> too much.
>>>>
>>>> 250 W is TDP, the average value declared is about 160 W. I don't
>>>> remember, which GPU - from AMD or Nvidia - has a lot of special
>>>> functional units for sin/cos/exp/etc. If they are not used, may be
>>>> the
>>>> power will a bit more lower.
>>>>
>>>> What is about Firestream 9250, AMD says about 150 W (although I'm
>>>>not
>>>> absolutely sure that it's TDP) - it's as for some
>>>> Intel Xeon quad-cores chips w/names beginning from X.
>>>>
>>>> Mikhail
>>>>
>>>>
>>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>>>>
>>>>>> BTW, why GPGPUs are considered as vector systems ?
>>>>>> Taking into account that GPGPUs contain many (equal) execution
>>>>>> units,
>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from
>>>>>> the software tools used (CUDA etc) ?
>>>>>>
>>>>>> Mikhail Kuzminsky
>>>>>> Computer Assistance to Chemical Research Center
>>>>>> Zelinsky Institute of Organic Chemistry
>>>>>> Moscow
>>>>>> _______________________________________________
>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>
>>
More information about the Beowulf
mailing list