[Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters

Thu Nov 20 08:23:31 PST 2008

> [shameless plug]
>
> A project I have spent some time with is showing 117x on a 3-GPU machine over 
> a single core of a host machine (3.0 GHz Opteron 2222).  The code is 
> mpihmmer, and the GPU version of it.  See http://www.mpihmmer.org for more 
> details.  Ping me offline if you need more info.
>
> [/shameless plug]

I'm happy for you, but to me, you're stacking the deck by comparing to a 
quite old CPU.  you could break out the prices directly, but comparing 3x
GPU (modern?  sounds like pci-express at least) to a current entry-level 
cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more appropriate.

at the VERY least, honesty requires comparing one GPU against all the cores
in a current CPU chip.  with your numbers, I expect that would change the 
speedup from 117 to around 15.  still very respectable.

I apologize for not RTFcode, but does the host version of hmmer you're 
comparing with vectorize using SSE?

>> or more generally: fairly small data, accessed data-parallel or with very 
>> regular and limited sharing, with high work-per-data.
>
> ... not small data.  You can stream data.

can you sustain your 117x speedup if your data is in host memory?
by small, I meant the on-gpu-card memory, mainly, which is fast but 
often more limited than host memory.

sidebar: it's interesting that ram is incredibly cheap these days,
and we typically spec a middle-of-the-road machine at 2GB/core.
but even 4GB/core is not much more expensive, but to be honest,
the number of users who need that much is fairly small.

>> GP-GPU tools are currently immature, and IMO the hardware probably needs a 
>> generation of generalization before it becomes really widely used.
>
> Hrmm...  Cuda is pretty good.  Still needs some polish, but people can use 
> it, and are generating real apps from it.  We are seeing pretty wide use ... 
> I guess the issue is what one defines as "wide".

Cuda is NV-only, and forces the programmer to face a lot of limits and 
weaknesses.  at least I'm told so by our Cuda users - things like having 
to re-jigger code to avoid running out of registers.  from my perspective,
a random science prof is going to be fairly put off by that sort of thing
unless the workload is really impossible to do otherwise.  (compared to 
the traditional cluster+MPI approach, which is portable, scalable and 
at least short-term future-proof.)

thanks, mark.