[Beowulf] OT? GPU accelerators for finite difference time domain

Sun Apr 1 11:22:33 PDT 2007

> Electromagnetics Research Symposium" in Verona.  There appears
> to be a considerable buzz now around FDTD calculations on GPUs.

the very latest gen GPUs (G80 and as yet unreleased R600) make very 
interesting coprocessors for vector-ish calculations which can be 
expressed using integer or single-precision operations.

> Has anyone any experience of this? How do these products stack
> up against the traditional Beowulf solution?

they _are_ in the spirit of Beowulf, which is all about hacking 
commodity hardware to suit HPC purposes.

> We are planning to buy a new Beowulf in the next few months. I'm
> wondering whether I should set aside some funds for GPU instead
> of CPU...

as with any purchase, you need to figure out what your workload needs,
and how you can feed it.  GPGPU requires substantial custom programming 
effort - there is no standardized interface (like MPI) to do it.

GPGPU makes a lot of sense where you have a research project which:
 	- has some large amount of high-level programming resources
 	(say, a top grad student for at least 6 months).
 	- is going to be seriously limited on normal hardware (ie,
 	runs will take 2 years each).
 	- has some promise of running well on GPU hardware (very SIMD,
 	needs to fit into limited memories, integer or 32b float, etc)

the speedup from a GPU is around an order of magnitude (big hand wave here).
the main drawback is that effort is probably not portable to other configs,
probably not to the future, and is probably in conflict with development of 
portable/scalable approaches (say f90/MPI).  really, this issue is quite 
similar to the tradeoff in pursuing FPGA acceleration.

in short, I think the opportunity for GPU is great if you have a pressing
need which cannot be practically satisfied using the conventional approach,
and you're able to dedicate an intense burst of effort at porting to the GPU.

as far as I know, there are not any well-developed libraries which simply
harness whatever GPU you provide, but don't require your whole program to 
be GPU-ized.  the cost of sharing data with a GPU is significant, but 
blas-3 might have a high enough work-to-size ratio to make it feasible.
3d fft's might also be expressible in GPU-friendly terms (the trick would
be to utilize not fight the GPU's inherent memory-access preferences.)
perhaps some MCMC stuff might be SIMD-able?  I doubt that sequence analysis
would make much sense, since GPUs are not well-tuned to access host memory,
and sequence programs are not actually that compute-intensive.  I'd guess 
that anything involving sparse matrices would be difficult to do on a GPU.

my organization will probably build a GPU-oriented cluster soon; I'm pushing
for it, but I'm fearful that we might not have users who are prepared to 
invest the intense effort necessary to take advantage of it.  I have some 
suspicion also that when Intel and AMD talk about greater integration between
CPU and GPU, they're headed in the direction of majorly extended SSE, rather
than something which still has parts called shader, vertex or texture.