[Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Sun Jun 15 10:36:26 PDT 2008

Joseph,

I'm even licensed CUDA developer by Nvidia for Tesla,
but even the documents there are very very poor. Knowing
latencies is really important. Another big problem is that if you
write such a program measuring the latencies, no dude is gonna run it.

The promises of nvidia was for videocards long ago released,
all of them being single precision. We know that for sure as most of
those videocards have been released years ago already.

Not to mention the 8800.

If you look however to the latest supercomputer announced, 1 peta  
instructions
a second @ 12960 new generation CELL cpu's, that means a tad more
than 77 gflop double precision for each CELL.

It is a bit weird if you claim to be NDA bound, whereas the news has  
it in
big capitals what the new IBM CELL can deliver.

See for example:

http://www.cbc.ca/cp/technology/080609/z060917A.html

Also i calculated back that all powercosts together it is about 410  
watt a node,
each node having a dual cell cpu. That's network + harddrives and RAM  
together of
course.

I calculated 2.66 MW for the entire machine based upon this article.

Sounds reasonably good to me.

Now interesting to know is how they are gonna use that cpu power  
effectively.

In any case it is a lot better than what the GPU's can deliver so far
in practical situations known to me. So claims by programmers other  
than the
nvidia engineer claims.

Especially stuff like matrix calculations, as the weak part of the  
GPU hardware is
the latency to and from the machine RAM (not to confuse with device  
RAM).

From/to 8800 hardware a reliable person i know measured it at 3000  
messages
a second, which would make it about several hundreds of microseconds of
communication latency speed.

So a very reasonable question to ask is what the latency is from the  
stream processors
to the device RAM. A 8800 document i read says 600 cycles. It doesn't  
mention for how
many streamprocessors this is though. Also surprising to know i  
learned that RAM
lookups do not get cached. That means a lot of extra work when 128  
stream processors
hammer regurarly onto the device RAM for stuff that CPU's simply  
cache in their L1 or L2
cache and todays even L3 caches.

So knowing such technical data is total crucial as there is no way to  
escape the memory
controllers latency in a lot of different software that searches for  
the holy grail.

Thanks,
Vincent

On Jun 15, 2008, at 3:51 PM, Joe Landman wrote:

>
>
> Vincent Diepeveen wrote:
>> Seems the next CELL is 100% confirmed double precision.
>> Yet if you look back in history, Nvidia promises on this can be  
>> found years back.
>
> [scratches head /]
>
> Vincent, it may be possible that some of us on this list may in  
> fact be bound by NDA (non-disclosure agreements), and cannot talk  
> about hardware which has not been announced.
>
>
>> The only problem with hardware like Tesla is that it is rather  
>> hard to
>> get technical information; like which instructions does Tesla  
>> support in hardware?
>
> [scratches head /]
>
> Hmmm .... www.nvidia.com/cuda is a good starting point.
>
> I might suggest http://www.nvidia.com/object/cuda_what_is.html as a  
> start on information.  More to the point, you can look at http:// 
> www.nvidia.com/object/cuda_develop.html
>
>
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>        http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
>