[Beowulf] Vector coprocessors

Thu Mar 16 07:11:26 PST 2006

Jim Lux wrote:
> At 12:04 AM 3/16/2006, Daniel Pfenniger wrote:
> 
>> The shipment of this accelerator card has been delayed many times. 
>> Last time
>> I asked was October 2005.   Apparently the first shipment has been 
>> made this
>> month for a Japanese supercomputer with 10^4 Opterons.   The cost is not
>> indicated, but something like above $8000.- per card would put it outside
>> commodity hardware.  I wouldn't be astonished that more performance can
>> be obtained in most applications with commodity clustering.

I think under 10k$ keeps it commodity (read as what most managers could 
likely sign for themselves without needing to walk the approval ladder).

> There are probably applications where a dedicated card can blow the 
> doors off a collection of PCs.  At some point, the interprocessor 
> communication latency inherent in any sort of cabling between processors 
> would start to dominate.

There are numerous such examples in life sciences, in chemistry, and 
other areas.  Such cards are not universal, they cannot be viewed as 
general purpose processors.  You have to view them as dedicated attached 
processors.

The Clearspeed cards have 2 of their co-processors.  Each has 96 FP 
units.  I believe the architecture is a systolic array.  To program them 
at a high level, you have a C variant that you can use, or you can hand 
code assembly.  The latter is hard.

The issue for these cards are the memory bandwidth in and out of the 
PCI-x based interface.  There are tricks you can play for a well 
designed system, but you cannot escape the bandwidth ceiling of PCI-x. 
For many algorithms of potential interest to this list, memory bandwidth 
is as important as FP performance.  Having effectively 100 processors on 
the far side of a narrow pipe means you have to design algorithms with 
that pipe width in mind.

>> If Clearspeed would consider mass production with a cost like 
>> $100.-$500.-
>> per card the market would be huge, because the card would be competing 
>> with
>> multi-core processors like the IBM-Sony Cell.

Kahan had some interesting things to say about the Cell.  Summarized 
like this.  You get to choose one with Cell:  Fast or Accurate.  He was 
making this point in general but pointed out some issues.  This is from 
a talk on his web site.  Caveat:  I don't have a cell to play with (yes 
Santa, I would like 1 or 2 hundred), so I can't run paranoia or other 
fun tests.

> You need "really big" volumes to get there. Retail pricing of $200 
> implies a bill of materials cost down in the sub $20 range. 

Yup.  Volume drives lower pricing.  Economies of scale matter.  This is 
why FPGAs are where they are price wise.  They don't have large volumes. 
  If they did, pricing should be better.

> Considering 
> that a run of the mill ASIC spin costs >$1M (for a small number of parts 
> produced), your volume has to be several hundred thousand (or a million) 
> before you even cover the cost of your development.
> 
> The video card folks can do this because
> a) each successive generation of cards is derived from the past, so the 
> NRE is lower.. most of the card (and IC) is the same

I believe they are in incremental improvement mode.  This keeps redesign 
costs way down.

> b) they have truly gargantuan volumes

This is the critical thing.  Remember, these are highly pipelined 
graphical supercomputers.  The ClawHMMer project ran a hardware 
accelerated HMMer on an nVidia GT6800 5x faster than the P4 hosting the 
card.

> c) they have sales from existing products to provide cash to support the 
> development of version N+1.

Cash is king.

> {I leave aside the possibility of magic elves, although with some 
> consumer products, I have no idea how they can design, produce, and sell 
> it at the price they do.  Making use of relative currency values can 
> also help, but that's in the non-technological magic elf category, as 
> far as I'm concerned.}

Actually lots of stuff is done outside the US these days.  Not magic 
elves per se, but Indian and Chinese engineers and scientists who are 
extremely good at what they do.  This starts getting into a cost and 
productivity discussion rather rapidly.

>> The possibly most interesting niche for the Clearspeed cards appears 
>> to me
>> accelerating proprietary applications like Matlab, Mathematica and 
>> particularly
>> Excel that run on a single PC and that can hardly be reprogrammed by 
>> their
>> users to run on a distributed cluster.
> 
> 
> 
> I would say that there is more potential for a clever soul to reprogram 
> the guts of Matlab, etc., to transparently share the work across 
> multiple machines.  I think that's in the back of the mind of MS, as 
> they move toward a services environment and .NET

:)

   So imagine if you will an LD_PRELOAD environment variable which 
points a users code over to the relevant libraries which work their 
magic behind the scenes.  I would be hard pressed to imagine using this 
for Excel, but could see it for Matlab.  Programming at high levels with 
high performance.  Of course Kahan also rips into them over accuracy ...

> 
> Jim
> 

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615