[Beowulf] Vector coprocessors

Daniel Pfenniger daniel.pfenniger at obs.unige.ch
Thu Mar 16 00:04:32 PST 2006

The shipment of this accelerator card has been delayed many times. Last time
I asked was October 2005.   Apparently the first shipment has been made this
month for a Japanese supercomputer with 10^4 Opterons.   The cost is not
indicated, but something like above $8000.- per card would put it outside
commodity hardware.  I wouldn't be astonished that more performance can
be obtained in most applications with commodity clustering.

If Clearspeed would consider mass production with a cost like $100.-$500.-
per card the market would be huge, because the card would be competing with
multi-core processors like the IBM-Sony Cell.

The possibly most interesting niche for the Clearspeed cards appears to me
accelerating proprietary applications like Matlab, Mathematica and particularly
Excel that run on a single PC and that can hardly be reprogrammed by their
users to run on a distributed cluster.


Bill Broadley wrote:
> I noticed a few news reports on Intel/AMD considering the Clearspeed
> co-processor.
> Looks like a fairly interesting widget, here's an Intel/Clearspeed paper
> that describes it:
> http://www.clearspeed.com/downloads/Intel%20Math%20Kernel%20whitepaper.pdf
> Some interesting snippets on the Clearspeed advance board:
> * 192 pipelines, 2 flops per clock (not fused), 250 MHz, peak 96GFlops
>   (I believe this is for 2 chips)
> * 50 GFlops sustained with the DGEMM kernel
> * 1 GB of ram per board.
> * 128 registers per PE, register file allows 3 reads 2 writes per clock
> * 1.44 MB of SRAM that can deliver one word per FP op per clock.
> * 800MB/sec over pci-x, enough for 50 GFlops on DGEMM.
> * Less than 10 watts while sustaining 25 GFlops
> * 1-D complex FFTs of 1024 elements @ 400k per second (20 GFlops with 32-bit),
>   but only 1/4th of that streaming because of pci-x bottlenecks.
> * 12 GFlops when running 2-d FFTs (512x512 single precision) that are
>   resident on board (in the 1GB)
> In any case it looks like an interesting development.
> Speaking of which, what is the double precision peak rate of today's p4 
> and opteron?  One 128 bit SSE operation every other cycle (so 1 64 bit
> flop per cycle)?  I believe Intel mentioned doubling this rate at IDF
> (shipping sometime in the 2nd half of this year).

More information about the Beowulf mailing list