[Beowulf] Nvidia, cuda, tesla and... where's my double floating point?
Vincent Diepeveen
diep at xs4all.nl
Sun Jun 15 10:36:26 PDT 2008
Joseph,
I'm even licensed CUDA developer by Nvidia for Tesla,
but even the documents there are very very poor. Knowing
latencies is really important. Another big problem is that if you
write such a program measuring the latencies, no dude is gonna run it.
The promises of nvidia was for videocards long ago released,
all of them being single precision. We know that for sure as most of
those videocards have been released years ago already.
Not to mention the 8800.
If you look however to the latest supercomputer announced, 1 peta
instructions
a second @ 12960 new generation CELL cpu's, that means a tad more
than 77 gflop double precision for each CELL.
It is a bit weird if you claim to be NDA bound, whereas the news has
it in
big capitals what the new IBM CELL can deliver.
See for example:
http://www.cbc.ca/cp/technology/080609/z060917A.html
Also i calculated back that all powercosts together it is about 410
watt a node,
each node having a dual cell cpu. That's network + harddrives and RAM
together of
course.
I calculated 2.66 MW for the entire machine based upon this article.
Sounds reasonably good to me.
Now interesting to know is how they are gonna use that cpu power
effectively.
In any case it is a lot better than what the GPU's can deliver so far
in practical situations known to me. So claims by programmers other
than the
nvidia engineer claims.
Especially stuff like matrix calculations, as the weak part of the
GPU hardware is
the latency to and from the machine RAM (not to confuse with device
RAM).
From/to 8800 hardware a reliable person i know measured it at 3000
messages
a second, which would make it about several hundreds of microseconds of
communication latency speed.
So a very reasonable question to ask is what the latency is from the
stream processors
to the device RAM. A 8800 document i read says 600 cycles. It doesn't
mention for how
many streamprocessors this is though. Also surprising to know i
learned that RAM
lookups do not get cached. That means a lot of extra work when 128
stream processors
hammer regurarly onto the device RAM for stuff that CPU's simply
cache in their L1 or L2
cache and todays even L3 caches.
So knowing such technical data is total crucial as there is no way to
escape the memory
controllers latency in a lot of different software that searches for
the holy grail.
Thanks,
Vincent
On Jun 15, 2008, at 3:51 PM, Joe Landman wrote:
>
>
> Vincent Diepeveen wrote:
>> Seems the next CELL is 100% confirmed double precision.
>> Yet if you look back in history, Nvidia promises on this can be
>> found years back.
>
> [scratches head /]
>
> Vincent, it may be possible that some of us on this list may in
> fact be bound by NDA (non-disclosure agreements), and cannot talk
> about hardware which has not been announced.
>
>
>> The only problem with hardware like Tesla is that it is rather
>> hard to
>> get technical information; like which instructions does Tesla
>> support in hardware?
>
> [scratches head /]
>
> Hmmm .... www.nvidia.com/cuda is a good starting point.
>
> I might suggest http://www.nvidia.com/object/cuda_what_is.html as a
> start on information. More to the point, you can look at http://
> www.nvidia.com/object/cuda_develop.html
>
>
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://www.scalableinformatics.com
> http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423
> fax : +1 866 888 3112
> cell : +1 734 612 4615
>
More information about the Beowulf
mailing list