[Beowulf] Vector coprocessors

Thu Mar 16 07:15:17 PST 2006

Daniel Pfenniger wrote:
> The shipment of this accelerator card has been delayed many times. Last 
> time
> I asked was October 2005.   Apparently the first shipment has been made 
> this
> month for a Japanese supercomputer with 10^4 Opterons.   The cost is not
> indicated, but something like above $8000.- per card would put it outside
> commodity hardware.  I wouldn't be astonished that more performance can
> be obtained in most applications with commodity clustering.
>

Commodity clustering isn't going to give you 50Gflops for 25Watts.  They 
are providing canned libraries to use the card, like BLAS.  If all you 
had to do was relink your program and it worked, that would 
significantly cheaper than porting your code to MPI.  Well, unless you 
happen to have access to a lot of grad students.

> If Clearspeed would consider mass production with a cost like $100.-$500.-
> per card the market would be huge, because the card would be competing with
> multi-core processors like the IBM-Sony Cell.
> 

Do you have pricing on the Cell blades or co-processor boards?  I doubt
they will be $8k, but I doubt they will be $100-$500.

Craig

> The possibly most interesting niche for the Clearspeed cards appears to me
> accelerating proprietary applications like Matlab, Mathematica and 
> particularly
> Excel that run on a single PC and that can hardly be reprogrammed by their
> users to run on a distributed cluster.
> 
> Dan
> 
> 
> Bill Broadley wrote:
>> I noticed a few news reports on Intel/AMD considering the Clearspeed
>> co-processor.
>>
>> Looks like a fairly interesting widget, here's an Intel/Clearspeed paper
>> that describes it:
>> http://www.clearspeed.com/downloads/Intel%20Math%20Kernel%20whitepaper.pdf 
>>
>>
>> Some interesting snippets on the Clearspeed advance board:
>> * 192 pipelines, 2 flops per clock (not fused), 250 MHz, peak 96GFlops
>>   (I believe this is for 2 chips)
>> * 50 GFlops sustained with the DGEMM kernel
>> * 1 GB of ram per board.
>> * 128 registers per PE, register file allows 3 reads 2 writes per clock
>> * 1.44 MB of SRAM that can deliver one word per FP op per clock.
>> * 800MB/sec over pci-x, enough for 50 GFlops on DGEMM.
>> * Less than 10 watts while sustaining 25 GFlops
>> * 1-D complex FFTs of 1024 elements @ 400k per second (20 GFlops with 
>> 32-bit),
>>   but only 1/4th of that streaming because of pci-x bottlenecks.
>> * 12 GFlops when running 2-d FFTs (512x512 single precision) that are
>>   resident on board (in the 1GB)
>>
>> In any case it looks like an interesting development.
>>
>> Speaking of which, what is the double precision peak rate of today's 
>> p4 and opteron?  One 128 bit SSE operation every other cycle (so 1 64 bit
>> flop per cycle)?  I believe Intel mentioned doubling this rate at IDF
>> (shipping sometime in the 2nd half of this year).
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf