[Beowulf] Has anyone actually seen/used a cell system?

Geoff Jacobs gdjacobs at gmail.com
Sun Oct 1 23:17:38 PDT 2006

Mark Hahn wrote:
>>> The same site reports that the X6800, a 2.93 GHz Core2 and sees
>>> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).
> hmm, those numbers are pretty low - peak should be 2.93*4 or 8,
> and I'd expect 80% of peak or 19 Gflops/core for this comparison
> (Opterons can do 90%, at least on my machine using HPL.)
I've consulted with some other information just to make sure I get this
right. We can't naively say that Core 2 maxes out at clock*4 or clock*8
for theoretical peak flops. Port 1 on the FPU can handle 4xSP flops, but
only simple operations like FPADD. Port 2 can handle FPMUL and FPDIV
(therefore FPADD as well) on a 4xSP vector.

So, there is a hard floor on theoretical Core 2 floating point
performance of clock*4 flops (for pure FPMUL and FPDIV), and a hard
ceiling of clock*8 flops (for a mix where FPADD is >=50%). Looking at
the source code, SGEMM is a FPMUL bruiser, which puts peak performance
closer to the floor than the ceiling. 12.5 gflops looks like an accurate
number for Core 2 SGEMM.

> so the paper shows 80.6 Gflops SGEMM for 8 SPE's; it's only fair to
> compare this to 2 or 4 Core2 cores (37.5 and 75 Gflops!)
Going by die size, Cell would compare with a hypothetical 3 core Core 2
CPU. (Cell is apparently ~220mm^2, Core 2 Duo ~140mm^2)

>> indicative of per core performance on Core 2. Is it safe to say that
>> Core 2 achieves <15 gflops/core at 3ghz, assuming ~15% premium with Goto
>> BLAS?
> peak SGEMM/core would be 3*8=24, so 15 sounds quite low.
>>> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a
> do you know of something crippled in the pre-production Cell chips?
Clock speed?


> I don't think there's anything too dubious about 80% of theoretical for
> Core2.  but I also didn't think the Sequoia stuff was such a cheap hack
> as you imply (not to put words into your mouth ;)
If we are to believe what's in the LBL paper, IBM is getting ~200 gflops
peak on SGEMM with full clock Cell engineering samples, which means peak
should be ~150 gflops on the chip in the article. 52%  of peak achieved
by Sequoia is probably a little low.

>> like to see a benchmark comparison of SGEMM (and DGEMM) using Core
>> 2-optimized BLAS vs. Cell-optimized BLAS, thereby making a useful
>> conclusion about how interesting Cell is for HPC.
> actually, Sequoia seems precisely like the structure you need to make Cell
> work, since it's whole purpose is to express the rather constrained way
> that memory is used in Cell.  the paper is actually pretty clear on
> where the Cell
> spends its time, and for SGEMM, it's executing the "leaf" code, which is
> IBM's Cell library.
It's whole purpose is to express distributed computers with arbitrary
memory topologies, from SMP (NUMA and non-) to clusters. It actually
looks really cool.

> I guess the prototype might be really bad, or Sequoia might be broken in a
> way not hinted in the paper, or IBM's Cell intrinsic library could be
> terrible.  but the paper seems on the up-and-up, and the scaling curves and
> leave-vs-communication figures surely make Cell look underwhelming,
> at least if you assume, as I do, that it has to deliver a large speedup
> to be worth investing in...
The paper is more of a statement on the capabilities of the Sequoia
compiler than the Cell processor. I don't think it's unreasonable to
assume their SGEMM implementation was written for clarity rather than speed.

Geoffrey D. Jacobs

Go to the Chinese Restaurant,
Order the Special

More information about the Beowulf mailing list