Could those of you who have perhaps used or researched the general
purpose use of GPUs (vendors, buyers, builders) to augment the data-
parallel compute power of your clusters add, subtract, and/or comment
on the following summary of the current options in this area?  What have
I failed to realize?  What other vendors are out there?  How difficult 
are the
programming environments to use?  What performance gains have you
observed?  Do you forecast Cell-based COTS-like clusters?  Interface
issues wtih MPI? Etc.

Thanks in advance ...


GPGPU compute space options micro-summary:

Option 1:

   Purchase high-performance graphics cards (Geforce, Radeon)
   for ~$400, drop them into your PCI-X slot (PCI-e soon to be
   available, learn some Cg programming, and you're ready to get
   10s of additional Gflops per node if you have stream-able kernels. 
   You are limited to 32-bit floating-point (and maybe non-IEEE).
   Also limited by the input/output bandwidth asymmetry of the
   graphics cards and its rigid, compute pipeline with limited conditional
   capability and programmability.

Option 2:

  Purchase ClearSpeed Array processing cards and software for your
  cluster (much more expensive, how much?) to get ~50 Gflops of additional
  compute power on steam-able kernels, programming environment is 
  better (is it?), you get full IEEE 64-bit floating point.  Do you have 
the same
  bandwidth asymmetry issues? 

Option 3:

  Your budget is big and you are interested in the Cell processor from
  IBM, you want a complete package, you call up Mercury Computer
  Systems, Inc. and buy their 16 Tflop, 7 blade, rack of dual-Cell boards,
  with high-performance libraries and presumably even better programming
  tools.  You get great IEEE 32-bit performance, not bad 64-bit capability,
  and support.  Anybody used, benchmarked this system?


