NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Tue Jun 17 10:07:54 PDT 2008

Jim,

I feel you notice a very important thing here.

That is that mainly for hobbyists like me a GPU is interesting to  
program for, or for companies who have a budget to buy less than a dozen
of them in total.

For ISPs the only thing that matters is the power consumption and for  
encryption at a low TCP/IP layer it's too easy to equip all those
cheapo cpu's with encryption coprocessors which are like 1 watt or so  
and are delivering enough work to get the 100 mbit/1 gigabit nics
fully busy, in case of public key it's in fact at a speed that you  
won't reach at a GPU when managing to parallellize it and it to work  
in a great
manner. The ISPs look for full scalable stuff of course such  
machines, quite the opposite of having 1 card @ 250 watt.

In fact it would be quite interesting to know how fast you can run  
RSA on a GPU. Where are the benchmarks there?
I tend to remember that i posted some solution to do a fast generic  
modulo (of course not a new idea, but that you always
hear after figuring something out), with a minimum of code under the  
condition you already have multiplying code.

How fast can you multiply big numbers on those GPU's?
4096 x 4096 bits is most interesting there. Then of course take the  
modulo quickly and repeat this for the entire exponent-squaring.

That is the only interesting question IMHO, what amount of throughput  
does it deliver for RSA4096.
I tend to remember a big bug it has in such a case is that the older  
cards (8800 etc) only can do 16 x 16 bits == 32 bits,
whereas at CPU's you can use 64 x 64 bits == 128 bits. BIG difference  
in speed.

Yet those hobbyists who are the interested persons in GPU programming  
have limited time
to get software to run and have a budget far smaller than $10k.   
They're not even gonna buy as much Tesla's as NASA will.
Not a dozen.

The state in which gpu programming is now is that some big companies  
can have a person toy fulltime with 1 gpu,
as of course the idea of having a cpu with hundreds of cores is very  
attractive and looks like a realistic future,
so companies must explore that future.

Of course every GPU/CPU company is serious in their aim to produce  
products that perform well, we all do not doubt it.

Yet it is only attractive to hobbyists and those hobbyists are not  
gonna get any interesting technical data needed to get the maximum
out of the GPU's from Nvidia. This is a big problem. Those hobbyists  
have very limited time to get their sparetime products done
to do numbercrunching, so being busy fulltime writing testprograms to  
know everything about 1 specific GPU is not something they
all like to do for a hobby. Just having that information will attract  
the hobbyists as they are willing to take the risk to buy 1 Tesla and
spend time there. That produces software. That software will have a  
certain performance,
based upon that performance perhaps some companies might get interested.

Intel and AMD will be doing better there i hope.

Note that testing CUDA also is suboptimal, it just runs for 5 seconds  
or so max. You need a machine with a 2nd videocard.
That requires a mainboard with at least 2x pci-e 16x. How to cluster  
that? My cluster cards are pci-x not pci-e. quadrics QM400's.

I can get boards @ 139 euro with 1 slot PCI-E 16x and build quadcore  
Q6600 nodes @ 500 euro, as soon as i have a job again.
My macbookpro 17'' has no pci-e 16x slot free though.

So for number crunching, a cluster always wins it from a single  
nvidia videocard.
The communication speed over the pci-e from the videocards is too  
slow latency to
parallellize software that is not-embarrassingly parallel.

Majority of hobbyists will have a similar problem with nvidia, very  
sad in itself.

A good CUDA setup that can beat a simplistic cluster is not so cheap  
and easy to program for like building
that cluster is. Also the memory scales better in those clusters than  
it does for the cards. If 1 node can do less
work than 1 GPU can, it still means that it's easier to get that  
exponential speedup by having a shared cache across
all nodes (this is true for a lot of modern crunching algorithms).

With a GPU you're forced to do all calculation including caching  
within the GPU and within the limited device RAM.
Now in contradiction to what most people tend to believe, usually  
there is methods to get away with a limited amount
of RAM with modern overwriting methods of caching, even when that  
loses a factor 2 there is ways to overcome that.
The biggest limitation is that communication with other nodes is real  
hard.

Scaling to more nodes is just not gonna happen of course as the  
latency between the nodes is real bad and it is
an extra slow hop to latency of course. First from device RAM to RAM  
then from RAM to card and from card to RAM and
from RAM to device RAM.

Let's make a list of problems that most clusters here calculate upon  
and you'll see how much the GPU concept still needs
to get matured to get it to work well for most codes.

Software that needs low latency interconnects you could get to work  
within 1 card  only therefore, provided the RAM access is not  
bottlenecked.
Yet all reports so far indicate it is. No information there is just  
not very encouraging and for professional crunching work where companies
therefore have a big budget for, building or buying in your own low  
power co-processor that so to speak even fits into an ipod is just  
too easy.

So in the end i guess some stupid extension of SSE will give a bigger  
increase in crunching power than the in itself attractive gpgpu  
hardware.
The biggest limitation being development time from hobbyists.

Vincent

On Jun 17, 2008, at 4:01 PM, Jim Lux wrote:

> Quoting Linus Harling <linus at ussarna.se>, on Mon 16 Jun 2008  
> 04:31:56 PM PDT:
>
>> Vincent Diepeveen skrev:
>> <snip>
>>>
>>> Then instead of a $200 pci-e card, we needed to buy expensive  
>>> Tesla's
>>> for that, without getting
>>> very relevant indepth technical information on how to program for  
>>> that
>>> type of hardware.
>>>
>>> The few trying on those Tesla's, though they won't ever post this as
>>> their job is fulltime GPU programming,
>>> report so far very dissappointing numbers for applications that  
>>> really
>>> matter for our nations.
>> </snip>
>>
>> Tomography is kind of important to a lot of people:
>>
>> http://tech.slashdot.org/tech/08/05/31/1633214.shtml
>> http://www.dvhardware.net/article27538.html
>> http://fastra.ua.ac.be/en/index.html
>>
>> But of course, that was done with regular $500 cards, not Teslas.
>
>
> Mind you, if you go and get a tomographic scan today, they already  
> use fast hardware to do it.  Only researchers on limited budgets  
> tolerate taking days to reduce the data on a desktop PC. And, while  
> the concept of doing faster processing with a <10KEuro box is  
> attractive in that environment, I suspect it's a long way from  
> being commercially viable in that role.
>
> The current tomographic technology (e.g. GE Lightspeed) is pretty  
> impressive.  They slide you in, and 10-15 seconds later, there's 3  
> d rendered models and slices on the screen.  The equipment is  
> pretty hassle free, the UI straightforward from what I could see, etc.
>
> And, of course, people are willing (currently) to pay many millions  
> for a machine to do this.  I suspect that the other costs of  
> running a CT scanner (both capital and operating) overwhelm the  
> cost of the computing power, so going from a $100K box to a $20K  
> box is a drop in the bucket.  When you're talking MRI, for  
> instance, there's the cost of the liquid helium for the magnets.
>
> That's a long way from a bunch of grad students racking up a bunch  
> of PCs.
>
>
>
>
>
>
>