NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Mon Jun 16 17:51:21 PDT 2008

Vincent Diepeveen wrote:
> Jim,
> 
> Reality is that the person who SPECULATES that something is good
> also hides behind a DNA. This is another typical case of that.

Er, who's hiding?  Get your credit card out and buy one.  CUDA
is easily available.  Today's new nvidia release has resulted in a handful at 
www.newegg.com, I'm sure many others will sell them to you as well.

> On the one hand claiming a NDA, on the other hand implying that is a 
> very good product that will get
> released.

Indeed, there were numerous rumors, even mentions from researchers with 
pre-release hardware who claimed 2nd half 07 for double precision availability.

> If NVIDIA clearly indicates towards me that they aren't gonna release 
> any technical data nor support anyone with technical
> data (the word NDA has not been used by the way by either party, it was 
> a very clear NJET from 'em),
> and all information i've got so far is very dissapointing, then hiding 
> behind a NDA is considered very bad manners.

The CUDA docs look pretty good to me, as usual the definitions of latency, 
bandwidth, message passing, flops, and overall performance varies.  So 
whatever you have in mind the best bet is to actually try it.  Fortunately 
nvidia makes it rather easy to get going.  If you write a microbenchmark I 
suspect you could get it run.

> Then instead of a $200 pci-e card, we needed to buy expensive Tesla's 
> for that, without getting

Even tesla's were single precision I believe, at least until today.

> The few trying on those Tesla's, though they won't ever post this as 
> their job is fulltime GPU programming,
> report so far very dissappointing numbers for applications that really 
> matter for our nations.

People seem happy, of course they are far from general purpose, but the 
progress (especially with CUDA) seems pretty good.

> Truth is that you can always do a claim of 1 teraflop of computing 
> power. If that doesn't get backupped by technical documents

You mean like (mentioned on the list previously)
http://arxiv.org/abs/0709.3225

> how to get it out of the hardware if your own testprograms show that you 
> can't get that out of the hardware,
> it is rather useless to start programming for such a platform.

GPUs are hardly suited for everyone, that doesn't make them useless.

Personally finding a port of McCalpin's stream seeing 50GB/sec or so
caught my attention.  Sure it's not a recompile and go, but at least
it's fairly c like.

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       50077.8888       0.0013       0.0013       0.0013
Scale:      50637.4974       0.0013       0.0013       0.0013
Add:        51090.5662       0.0019       0.0019       0.0019
Triad:      50527.6617       0.0019       0.0019       0.0019

www.nvidia.com claims said card has a 57GB/sec memory system, today's new 
260/280 are in the 130-140GB/sec range (advertised, not stream).

> It is questionable whether it is interesting to design some algorithms 
> for GPU's; it takes endless testing of every tiny detail to figure out
> what the GPU can and cannot do and to get accurate timings. By the time 
> you finish with that, you can also implement the same design in
> FPGA or ASIC/VLSI whatever. As that is of course the type of interested 

Sounds wildly off base to me.  But if you implement a FPGA/ASIC + memory 
system that costs less than $200 qty 1 and can be programmed in a few hours
to implement stream at or above 50GB/sec let me know.

> parties in GPU programming;
> considering the amount of computing power they need, for the same budget 
> they can also make their own CPU's.

Even assuming man months of highly paid programmers I don't see how you get 
from that cost to the budget for making your own CPU.

> For other companies that i tried to get interested, there is a lot of 
> hesitation to even *investigate* that hardware, let alone give a contract
> job to port their software to such hardware. Nvidia for all those 
> civilian and military parties is very very unattractive as of now.

CUDA is the start of practical use of a GPU for non graphics related jobs, 
seems like a good start to me, sure not everything fits, but it's progress.
Today's new chips add double precision which definitely helps.

> IBM now shows up with a working supercomputer using new generation CELL 
> processors which have a sustained 77 Gflop double precision
> a chip which means a tad more than 150 Gflop for each node. Each 
> rackmount node is relative cheap and performs very well.

Sure, for rather specialized tasks.  Each SPE has what, 32KB of memory?  I 
think of it more as a DSP than a CPU.  BTW the previous generation cell did 
double precision as well, it was just 1/10th as fast as single precision.  The 
new revision is around 100GFlops/sec up from 14 ish.