[Beowulf] Vector coprocessors AND CILK

Tue Mar 21 19:18:54 PST 2006

----- Original Message ----- 
From: "Daniel Pfenniger" <daniel.pfenniger at obs.unige.ch>
To: "Jim Lux" <James.P.Lux at jpl.nasa.gov>
Cc: <beowulf at beowulf.org>
Sent: Thursday, March 16, 2006 6:32 PM
Subject: Re: [Beowulf] Vector coprocessors

>
>
> Jim Lux wrote:
> ...
>> There are probably applications where a dedicated card can blow the doors 
>> off a collection of PCs.  At some point, the interprocessor communication 
>> latency inherent in any sort of cabling between processors would start to 
>> dominate.
>
> As usual it depends on the applications. Vector computations
> are not universal, even if frequent in technical problems.
> In the favorable cases it is not rare to have say over 10% serial
> code that does not benefit from the card.  In the end the card, despite 
> its
> 192 procs, may just accelerate typical applications by a factor a few.
>
>>> If Clearspeed would consider mass production with a cost like 
>>> $100.-$500.-

If you produce such cards in low quantity you lose roughly 100 dollar to the 
pci card to
royalties basically then add chip production price. 2 big chips, well i do 
not know what price
they are. Sound expensive to me. I talked about 1 big chip for some other 
card.

That chip had a price, when mass produced, of 50 dollar a chip.

So bare production price of this card i estimate at around 250 dollar. You 
don't want to lose bigtime
on such a card of course.

That means an importer price of 500 and a consumer price is a minimum of 
1000 dollar.

Now you skip the importer of course with such types of cards.

According to my economy book then a company can then follow 2 approaches. 
You can try to
flood the market and sell 50 million of them, which means that the card will 
be priced 1000 dollar.

Or you can act realistic that even a lower price of the card will not 
increase sales by more than a factor 2.

In short the highest price you can reasonably ask is most interesting, 
because there is plenty of
universities who want a few cards to toy with. They pay 8000+ dollar and 
that's really a minimum for those guys,
because if they start toying they'll ship you 100 questions, after which the 
card gets dusted and keeps unused.

If you're serious and you want to buy 200 of their cards, then you're a big 
customer.
Propose them a secret deal in this sense that you don't publicly reveal the 
price paid,
and you sign for it that first 3 years you won't resell their cards nor lend 
them nor hire them
to other persons. Under that condition you offer $200k for 200 cards.

After some giving and taking you pay in the end $2000 a card.

Not bad for a 10 Tflop cluster double precision for roughly $500k in that 
case.

Note that if you build a cluster from such cards that the bandwidth your 
PCI-X has will be a limitation to other nodes
anyway. Now only difference is that it is a limitation from every point to 
every point. So that's in fact a more symmetric
programming approach from programmers viewpoint.
Nothing new there.

>>> per card the market would be huge, because the card would be competing 
>>> with
>>> multi-core processors like the IBM-Sony Cell.
>>
>> You need "really big" volumes to get there.

Such cards aren't competing with the Cell at all.

You are only competing if products can get bought in a store at the same 
day.

> Yes, but it does not seem to me unreasonable to put such a card in
> millions of PC's if the average applications run a bit faster and the
> cost increase stays below the PC cost.  After all
> the 8087 math coprocessor of the i386 era did just that.

The average user cares for faster running his game, not for some double 
precision floating point monster.
Most 3d graphics operations i'd qualify as single precision, not as double 
precision.

> ....
>>
>> I would say that there is more potential for a clever soul to reprogram 
>> the guts of Matlab, etc., to transparently share the work across multiple 
>> machines.  I think that's in the back of the mind of MS, as they move 
>> toward a services environment and .NET
>
> Lots of people have thought about that for a long time, including
> Cleve Moeller.   The potential clever soul should be well above
> average, and considering MS products, well above MS average programmer.
>
> An intriguing way to parallelize C with threads on multicore processors is
> provided by Cilk (http://supertech.lcs.mit.edu/cilk/).  Cilk consists of
> a couple of simple extensions to the C language.

> If anyone has experience with Cilk it would be nice to share.

RAISES A BIG HAND.

> Dan

I guess you googled a bit around and found the prehistoric ancient parallel 
programming language CILK.
You can use this inside C code indeed.

Yes i know how to use it. I also know how it performed in chessprograms.

In the famous cilk chess program.

Called Cilkchess from MIT. Leierson & co. Programmed actually by Don Dailey, 
a very nice guy.

Single cpu his program got around 180k nps.
With cilk compiled it achieved around 5000 nodes a second.

That was single cpu.

When running at a 512 processor machine it dropped even more in performance.

I remember playing them and their scaling was pretty bad.

They basically claimed a good scaling as they assumed 1 cpu = 5000 nodes.
However i calculated their scaling as 1 cpu = 180k nodes, as without cilk 
it's 180k nps.

This scientific way to look good on paper is very well known.

First slow down your parallel program a dozen times, in this case 20-50 
times in order to
show better scaling.

MIT's scaling as calculated by me was around 2%.

On other hand, i've parallellized my chessprogram myself instead of using 
Cilk
That was of course a lot harder, and took me 1.5 years of hard programming,
but it scales very well.

It scales 50+% at 512 processors.

Actually on paper i could claim probably 100+%, as 1 cpu could search 20k 
nps, and at 460 cpu's,
i reached a peek of around 9.99 mln nps. Now the reason for that is because 
i used global transposition table.

AFAIK Cilkchess wasn't using that, which makes their achievement even more 
pathetic.

Yet you can look at it from another viewpoint.

If you want to parallellize something, if the machine isn't yours anyway, 
and if you want a quick result,
why not allocate 10000 processors and run it with cilk?

You know, if you lose a factor 100 or so, who cares, you're still 100 times 
faster than your PC!

That said there probably will be cases of programs that scale better with 
cilk, if they are really embarrassingly parallel.

But you better have some good highend network with Cilk anyway.

Cilk is like programming in BASIC.

BASIC is easy for beginners and you can get a job quickly done, if it's 
embarrassingly parallel, you might not even
suffer too much performance penalty, and you didn't lose time yours.

If you want something that performs better, then consider MPI.

If you want to be faster than MPI, then parallellize things without MPI 
within 1 shared memory node, and parallellize
between the nodes with MPI. It's more effort than CILK for sure.

But don't use CILK to get the 'maximum' performance out of a machine.
That's wishful thinking.

Vincent

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>