[Beowulf] Has anyone actually seen/used a cell system?

Vincent Diepeveen diep at xs4all.nl
Wed Sep 20 15:15:10 PDT 2006


----- Original Message ----- 
From: "Mark Hahn" <hahn at physics.mcmaster.ca>
To: <J.A.Delcorso at larc.nasa.gov>
Cc: <beowulf at beowulf.org>
Sent: Wednesday, September 20, 2006 6:51 PM
Subject: Re: [Beowulf] Has anyone actually seen/used a cell system?


>> Can anyone point me to a url, or tell me what their
>> experience is with this technology?  Is it as fast as
>> it's purported to be?
>
> I haven't come anywhere near a Cell, but then again, I'm not sure I'd want 
> to.  14.6 Gflops (64b, and assuming the full 8 SPE's) isn't bad, but then 
> again, a 3 GHz Core2 dual-core is 24 Gflops, and almost certainly a lot 
> more accessible, shipping now, runs linux, supported by compilers and 
> goto-blas, etc.

Comeon let's do some realistic comparision. Assuming IBM didn't totally mess 
up,
let's do an objective compare for multiplication.

Gflops is an overrated definition simply.

The thing determining the number of matrix elements you can multiply a 
second more than anything else,
is the slow instruction on most cpu called multiply.

It is 4 cycles at P4 or so (SSE2) and 4 cycles at K8.

Didn't see a conroe document yet but knowing it also has just a SINGLE 
execution unit doing multiplies (and probably casting the SSE2 
multiplication unit for FPU and also using that one for integers or 
something) it means probably also a cycle or 4 for it.

Just it is possible that when doing a multiplication that it doesn't block 
all other execution units (which is what K8 seems to be doing).

For the NTT i'm doing here (that is a bugfree form of multiplication, the 
FFT version you never know for sure your result is correct and you have to 
redo it a second time to be 100% sure) what is interesting is a 
multiplication of 64 x 64 bits == 128 bits. So that's obviously integer 
calculation.

If we compare core2 there, then core2 is an ideal processor for about 
everything, yet it has 2 cores @ 3Ghz.

2 cores @ 3Ghz  / 4 cycles = 1.5 Ghz multiply cycle

Now if we compare the CELL processor. Not sure about its latest plans (i 
remember vaguely 4Ghz as its target and i would be amazed if IBM actually 
gets it to 4Ghz). Now it most likely will also manage to get it down to a 
cycle or 4 for a multiply 64 x 64 bits == 128 bits.

Then we're speaking about 8 * 4Ghz / 4 = 8Ghz multiply cycle.

A potential 6 times faster simply than core2 for what is the most time 
consuming part of
matrix multiplications, namely the multiplication unit.

Now there is something to say for SSE here which with 1 dang can multiply 2 
at a time.

On other hand we do not know the specs of the CELL there which should be 
able to do more instructions a cycle than core2 in one document i read 
(could be totally outdated).

If not then core wins back factor 1.5 or so in speed there, still no big 
deal. CELL just beats it totally there.

Now it is of course obvious that the vaste majority of resources that go 
from clusters to software is used for matrix multiplication type software. 
So that it might be extremely ugly weak in branch mispredicts, which means 
it is a selfdestructing chip that cell for my chess software, that's the 
other part of the story.

Say about 70% will be extremely happy with that chip and 30% will just 
praise core2 into the skies.

There is something positive however about core2 which cell cannot say and 
that is that core2 we can already order in a store.

> if you could readily get a 8-16x PCIE card with 2 or more Cell chips and a 
> bunch of ~50 GB/s local memory, for cheap, it could be quite something.

Yeah that's faster than most supercomputers for matrix calculations.

And also for a CHEAP price.

For all the highend guys who will then say: "oh ahhh au, but how about 
losing bits".

Well, nothing as inaccurate as FFT calculations with floating point 
roundoffs everywhere.

NTT is totally superior there (but factor 2 slower).

And if you really have no other argument than that, well just run a SECOND 
cluster of cells
and let those calculate for you be calculated a second time. Which gives a 
100% verification
that your FFT ran correct too.

Of course another disadvantage of CELL will probably be limited RAM.

Certain machines (orion!) which are relative cheap and have a couple of 
hundreds of gigabyte of RAM against an attractive price can really boost 
certain applications.

Yet pissing on CELL isn't a real good idea.

If what you need is massive calculation power then 8 cores @ 4Ghz will of 
course kick silly 2 cores @ 3Ghz, especially knowing that most chip 
manufacturers don't seem to have an especially fast multiply instruction on 
their chips.

Just measuring gflops is total madness.

The N log N in those calculations is the number of multiplies.

Make a chip with 2 integer multiplication units that don't block each other 
and NTT in integers is faster than any SSE implementation of FFT, besides 
having 0 round off errors.

CELL is already quite ideal there in that it has 8 cores.

Yet of course it is wishful thinking such chips exist any soon with 2 
multiplication units for a very cheap price (no itanium isn't a cheap chip 
additional it's just 1.6Ghz) which would simply speedup that calculation 
code factor 2.

If i nonstop do integer multiplications in that k8 dual core chip at 2 chips 
(4 cores in total) then after a number of days the machine is just DEAD 
sometimes. black screen etcetera. Just the chips failed simply.

It only happens if you EXCLUSIVELY do NTT nonstop, so it seems that at least 
for K8 dual core chips the multiplication unit is extremely weak and belongs 
to the worst case path.

That means probably that adding a second unit will not cost that much more 
transistors, but will decrease yields, making the chip production a tad more 
expensive.

So please don't piss on a chip that has hopefully 8 such units instead of 
todays chips 2.

It is potentially at least factor 4 faster at the same clock for such DSP 
type code.

Vincent

>> Apparently RedHat is developing
>> EL 4.3 to run on the system?
>
> to an OS, it's basically a kinda low-end PPC chip with 8 very weird FP 
> coprocessors, the latter not relevant to the OS...
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 




More information about the Beowulf mailing list