[Beowulf] What class of PDEs/numerical schemes suitable for GPUclusters

Thu Nov 20 11:01:31 PST 2008

Jeff Layton wrote:
>> offhand, I'd guess that adaptive grids will be substantially harder
>> to run efficiently on a GPU than a uniform grid.
>
> One key thing is that unstructured grid codes don't work as well.
> The problem is the indirect addressing.

Bingo. GPUs are still GPUs, and are still heavily optimized for coherent 
data access patterns. If cell (x, y) depends on data at (x, y), then cell (x 
+ 1, y) better depend on data at cell (x + 1, y) or performance will suffer 
terribly. In C-speak:
  x += C[i][j];
is good, and
  x += C[Idx[i][j]];
is bad. Similarly bad is non-coherent branching, due to the thread grouping.

The ideal workload is one that had minimal or no branching, and can be 
mapped into a computational model where you have a 1-, 2-, or 3-dimensional 
arrangement of cells, where the computation (including the relative position 
for any data lookups) for each cell does not change. IME, as soon as you 
depart significantly from this workload, you often start to see order of 
magnitude drops in performance. Additionally, the round-trip CPU->GPU->CPU 
latency is horrific (in the order of 1 ms on my 8800GTX on Vista, though I'm 
not sure about the newer cards or other OSes) so unless you can get a good 
pipeline going, bouncing computation between the CPU and GPU can wreck the 
overall performance. This also makes it very hard to scale out to more than 
one card.

I've spent a fair amount of time tweaking a bit of software that at its core 
is a RKF45 adaptive integrator on a number of independent entities, with 
some other GPU-unfriendly code (very branchy and with PRNGs). The optimal 
method that I've found for this code is to do the integration substeps on 
the GPU, but all other processing on the CPU. The GPU doesn't worry if the 
requested substep has excessive error, it just passes back the better 
step-size to the CPU and doesn't update the data. The CPU then notices that 
the returned "next" stepsize is smaller than the stepsize it sent, and 
handles the situation correctly. Subdividing steps on the GPU (or simply 
looping around with the smaller step sizes until the error is sufficiently 
small) is a performance loss. Additionally, since the entities are 
essentially independent, I can have multiple sets in progress at once. The 
peak seems to be to break it into 4 sets, presumably corresponding to one 
being sent to the GPU, one being processed on the GPU, one coming back from 
the GPU, and one being processed on the CPU. The performance gain going from 
1 set to 4 is about a factor of 2.5.

Cheers,
Michael