[Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"

Wed Jun 18 16:34:59 PDT 2008

Greg Lindahl wrote:
> On Wed, Jun 18, 2008 at 10:51:04AM -0400, Prentice Bisbal wrote:
> 
>> Someone made the inaccurate statement that CUDA programming is difficult
>> and time-consuming.
> 
> One data point cannot prove that CUDA is easy. There are people out
> there claiming that FPGAs are easy to program, because they're one of
> the 7 people on the planet for whom programming an FPGA is easy.

*chuckle*.

> I've looked over CUDA and some examples, and while it's better looking
> than some of the other GPU programming languages out there, it's clear
> that it is more difficult and time-consuming than using traditional
> languages on traditional cpus.

Agreed.  Traditional languages are easier, but don't express parallelism well.

One approach is of course openMP, a few pragmas, and parallel friendly (loop 
index independent) loops and you can get reasonable speedups on SMP machines.

Cuda seems to take a different approach, instead of trying to auto-parallelize
a loop, it requires a function pointer to the code, and the function must
declare it's exit condition.

CUDA seems rather similar to openMP.  Massimiliano Fatica of nvidia did a 
stream port, and I'll quote pieces of his code below.

So instead of:
for (j=0; j<N; j++)
	    b[j] = scalar*c[j];

You get:
__global__ void STREAM_Scale(float *a, float *b, float scale,  int len)
{
   int idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < len) b[idx] = scale* a[idx];
}

Then to actually time and launch:
   times[1][k]= mysecond();
   STREAM_Scale<<<dimGrid,dimBlock>>>(d_b, d_c, scalar,  N);
   cudaThreadSynchronize();
   times[1][k]= mysecond() -  times[1][k];

Instead of:
static double	a[N+OFFSET]

You get:
  cudaMalloc((void**)&d_a, sizeof(float)*N);

Instead of:
     for (j=0; j<N; j++) {
	a[j] = 1.0;
	b[j] = 2.0;
	c[j] = 0.0;
	}

You get:
   set_array<<<dimGrid,dimBlock>>>(d_a, 2.f, N);
   set_array<<<dimGrid,dimBlock>>>(d_b, .5f, N);
   set_array<<<dimGrid,dimBlock>>>(d_c, .5f, N);

So yes, it's a change, but it does seem pretty reasonable.