[Beowulf] Options for augmenting cluster vector/data-parallel computing power ...

Tue Jun 13 15:41:16 PDT 2006

Richard Walsh wrote:

> All,
>
> Could those of you who have perhaps used or researched the general
> purpose use of GPUs (vendors, buyers, builders) to augment the data-
> parallel compute power of your clusters add, subtract, and/or comment
> on the following summary of the current options in this area?  What have
> I failed to realize?  What other vendors are out there?  How difficult 
> are the
> programming environments to use?  What performance gains have you
> observed?  Do you forecast Cell-based COTS-like clusters?  Interface
> issues wtih MPI? Etc.
>
> Thanks in advance ...
>
> rbw
>
> GPGPU compute space options micro-summary:
>
> Option 1:
>
>   Purchase high-performance graphics cards (Geforce, Radeon)
>   for ~$400, drop them into your PCI-X slot (PCI-e soon to be
>   available, learn some Cg programming, and you're ready to get
>   10s of additional Gflops per node if you have stream-able kernels.   
> You are limited to 32-bit floating-point (and maybe non-IEEE).
>   Also limited by the input/output bandwidth asymmetry of the
>   graphics cards and its rigid, compute pipeline with limited conditional
>   capability and programmability.
>
> Option 2:
>
>  Purchase ClearSpeed Array processing cards and software for your
>  cluster (much more expensive, how much?) to get ~50 Gflops of additional
>  compute power on steam-able kernels, programming environment is 
> presumably
>  better (is it?), you get full IEEE 64-bit floating point.  Do you 
> have the same
>  bandwidth asymmetry issues?

Well I work for Clearspeed, so I can give some factual information on 
our product, but I will refrain from any hard cell (sic).

Each board has 1GB of memory and 2 CPUs. Each CPU has a serial unit and 
96 SIMD parallel units (PEs).
Each PE has both 64-bit IEEE FP add and multiply units and because of 
VLIW can issue a fused Muladd at a rate of one every clock tick.
The CPU clocks at 250MHz - this keeps the power consumption down to only 
25W per dual-cpu board.
So theoretical peak performance is 96GF per board, but for marketing 
reasons we quote the more realistic 50GF that you might expect to see 
from a real app. (albeit well tuned).

If you just do DGEMM or say 2D FFTs, then the libraries are already 
there - just change your LD_LIBRARY_PATH and it will intercept those 
ACML and FFTW calls. If what you do is different - then there is a C 
compiler for the board. Standard C - just with a prefix of 'poly' before 
any variable that you want to have 96 instances rather than just one, 
and of course parallel implementations on sin(), sqrt() et al.

In your host application - you just use the provided API to initialise 
one or more boards, load a binary onto it, send data across and launch 
your preloaded binary, then poll to read the results back as they are 
generated.

reads and writes to the board's memory should be close to whatever your 
PCIx/PCIe chipset can sustain.

on yes and the list price is $8000 - yes a bit more that your GEForce.

Dr. Daniel Kidger, Technical Consultant, Clearspeed plc, Bristol UK
E: dan.kidger at clearspeed.com
T: +44 117 317 2030
M: +44 7738 458742
"Write a wise saying and your name will live forever." - Anonymous.