[Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bill Broadley bill at cse.ucdavis.eduWed Jun 18 16:34:59 PDT 2008
- Previous message: [Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"
- Next message: [Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Greg Lindahl wrote:
> On Wed, Jun 18, 2008 at 10:51:04AM -0400, Prentice Bisbal wrote:
>
>> Someone made the inaccurate statement that CUDA programming is difficult
>> and time-consuming.
>
> One data point cannot prove that CUDA is easy. There are people out
> there claiming that FPGAs are easy to program, because they're one of
> the 7 people on the planet for whom programming an FPGA is easy.
*chuckle*.
> I've looked over CUDA and some examples, and while it's better looking
> than some of the other GPU programming languages out there, it's clear
> that it is more difficult and time-consuming than using traditional
> languages on traditional cpus.
Agreed. Traditional languages are easier, but don't express parallelism well.
One approach is of course openMP, a few pragmas, and parallel friendly (loop
index independent) loops and you can get reasonable speedups on SMP machines.
Cuda seems to take a different approach, instead of trying to auto-parallelize
a loop, it requires a function pointer to the code, and the function must
declare it's exit condition.
CUDA seems rather similar to openMP. Massimiliano Fatica of nvidia did a
stream port, and I'll quote pieces of his code below.
So instead of:
for (j=0; j<N; j++)
b[j] = scalar*c[j];
You get:
__global__ void STREAM_Scale(float *a, float *b, float scale, int len)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < len) b[idx] = scale* a[idx];
}
Then to actually time and launch:
times[1][k]= mysecond();
STREAM_Scale<<<dimGrid,dimBlock>>>(d_b, d_c, scalar, N);
cudaThreadSynchronize();
times[1][k]= mysecond() - times[1][k];
Instead of:
static double a[N+OFFSET]
You get:
cudaMalloc((void**)&d_a, sizeof(float)*N);
Instead of:
for (j=0; j<N; j++) {
a[j] = 1.0;
b[j] = 2.0;
c[j] = 0.0;
}
You get:
set_array<<<dimGrid,dimBlock>>>(d_a, 2.f, N);
set_array<<<dimGrid,dimBlock>>>(d_b, .5f, N);
set_array<<<dimGrid,dimBlock>>>(d_c, .5f, N);
So yes, it's a change, but it does seem pretty reasonable.
- Previous message: [Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"
- Next message: [Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
