[Beowulf] Has anyone actually seen/used a cell system?
diep at xs4all.nl
Wed Sep 20 15:15:10 PDT 2006
----- Original Message -----
From: "Mark Hahn" <hahn at physics.mcmaster.ca>
To: <J.A.Delcorso at larc.nasa.gov>
Cc: <beowulf at beowulf.org>
Sent: Wednesday, September 20, 2006 6:51 PM
Subject: Re: [Beowulf] Has anyone actually seen/used a cell system?
>> Can anyone point me to a url, or tell me what their
>> experience is with this technology? Is it as fast as
>> it's purported to be?
> I haven't come anywhere near a Cell, but then again, I'm not sure I'd want
> to. 14.6 Gflops (64b, and assuming the full 8 SPE's) isn't bad, but then
> again, a 3 GHz Core2 dual-core is 24 Gflops, and almost certainly a lot
> more accessible, shipping now, runs linux, supported by compilers and
> goto-blas, etc.
Comeon let's do some realistic comparision. Assuming IBM didn't totally mess
let's do an objective compare for multiplication.
Gflops is an overrated definition simply.
The thing determining the number of matrix elements you can multiply a
second more than anything else,
is the slow instruction on most cpu called multiply.
It is 4 cycles at P4 or so (SSE2) and 4 cycles at K8.
Didn't see a conroe document yet but knowing it also has just a SINGLE
execution unit doing multiplies (and probably casting the SSE2
multiplication unit for FPU and also using that one for integers or
something) it means probably also a cycle or 4 for it.
Just it is possible that when doing a multiplication that it doesn't block
all other execution units (which is what K8 seems to be doing).
For the NTT i'm doing here (that is a bugfree form of multiplication, the
FFT version you never know for sure your result is correct and you have to
redo it a second time to be 100% sure) what is interesting is a
multiplication of 64 x 64 bits == 128 bits. So that's obviously integer
If we compare core2 there, then core2 is an ideal processor for about
everything, yet it has 2 cores @ 3Ghz.
2 cores @ 3Ghz / 4 cycles = 1.5 Ghz multiply cycle
Now if we compare the CELL processor. Not sure about its latest plans (i
remember vaguely 4Ghz as its target and i would be amazed if IBM actually
gets it to 4Ghz). Now it most likely will also manage to get it down to a
cycle or 4 for a multiply 64 x 64 bits == 128 bits.
Then we're speaking about 8 * 4Ghz / 4 = 8Ghz multiply cycle.
A potential 6 times faster simply than core2 for what is the most time
consuming part of
matrix multiplications, namely the multiplication unit.
Now there is something to say for SSE here which with 1 dang can multiply 2
at a time.
On other hand we do not know the specs of the CELL there which should be
able to do more instructions a cycle than core2 in one document i read
(could be totally outdated).
If not then core wins back factor 1.5 or so in speed there, still no big
deal. CELL just beats it totally there.
Now it is of course obvious that the vaste majority of resources that go
from clusters to software is used for matrix multiplication type software.
So that it might be extremely ugly weak in branch mispredicts, which means
it is a selfdestructing chip that cell for my chess software, that's the
other part of the story.
Say about 70% will be extremely happy with that chip and 30% will just
praise core2 into the skies.
There is something positive however about core2 which cell cannot say and
that is that core2 we can already order in a store.
> if you could readily get a 8-16x PCIE card with 2 or more Cell chips and a
> bunch of ~50 GB/s local memory, for cheap, it could be quite something.
Yeah that's faster than most supercomputers for matrix calculations.
And also for a CHEAP price.
For all the highend guys who will then say: "oh ahhh au, but how about
Well, nothing as inaccurate as FFT calculations with floating point
NTT is totally superior there (but factor 2 slower).
And if you really have no other argument than that, well just run a SECOND
cluster of cells
and let those calculate for you be calculated a second time. Which gives a
that your FFT ran correct too.
Of course another disadvantage of CELL will probably be limited RAM.
Certain machines (orion!) which are relative cheap and have a couple of
hundreds of gigabyte of RAM against an attractive price can really boost
Yet pissing on CELL isn't a real good idea.
If what you need is massive calculation power then 8 cores @ 4Ghz will of
course kick silly 2 cores @ 3Ghz, especially knowing that most chip
manufacturers don't seem to have an especially fast multiply instruction on
Just measuring gflops is total madness.
The N log N in those calculations is the number of multiplies.
Make a chip with 2 integer multiplication units that don't block each other
and NTT in integers is faster than any SSE implementation of FFT, besides
having 0 round off errors.
CELL is already quite ideal there in that it has 8 cores.
Yet of course it is wishful thinking such chips exist any soon with 2
multiplication units for a very cheap price (no itanium isn't a cheap chip
additional it's just 1.6Ghz) which would simply speedup that calculation
code factor 2.
If i nonstop do integer multiplications in that k8 dual core chip at 2 chips
(4 cores in total) then after a number of days the machine is just DEAD
sometimes. black screen etcetera. Just the chips failed simply.
It only happens if you EXCLUSIVELY do NTT nonstop, so it seems that at least
for K8 dual core chips the multiplication unit is extremely weak and belongs
to the worst case path.
That means probably that adding a second unit will not cost that much more
transistors, but will decrease yields, making the chip production a tad more
So please don't piss on a chip that has hopefully 8 such units instead of
todays chips 2.
It is potentially at least factor 4 faster at the same clock for such DSP
>> Apparently RedHat is developing
>> EL 4.3 to run on the system?
> to an OS, it's basically a kinda low-end PPC chip with 8 very weird FP
> coprocessors, the latter not relevant to the OS...
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf