[Beowulf] Has anyone actually seen/used a cell system?

Sun Oct 1 23:17:38 PDT 2006

Mark Hahn wrote:
>>> The same site reports that the X6800, a 2.93 GHz Core2 and sees
>>> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).
> 
> hmm, those numbers are pretty low - peak should be 2.93*4 or 8,
> and I'd expect 80% of peak or 19 Gflops/core for this comparison
> (Opterons can do 90%, at least on my machine using HPL.)
I've consulted with some other information just to make sure I get this
right. We can't naively say that Core 2 maxes out at clock*4 or clock*8
for theoretical peak flops. Port 1 on the FPU can handle 4xSP flops, but
only simple operations like FPADD. Port 2 can handle FPMUL and FPDIV
(therefore FPADD as well) on a 4xSP vector.

So, there is a hard floor on theoretical Core 2 floating point
performance of clock*4 flops (for pure FPMUL and FPDIV), and a hard
ceiling of clock*8 flops (for a mix where FPADD is >=50%). Looking at
the source code, SGEMM is a FPMUL bruiser, which puts peak performance
closer to the floor than the ceiling. 12.5 gflops looks like an accurate
number for Core 2 SGEMM.

> so the paper shows 80.6 Gflops SGEMM for 8 SPE's; it's only fair to
> compare this to 2 or 4 Core2 cores (37.5 and 75 Gflops!)
Going by die size, Cell would compare with a hypothetical 3 core Core 2
CPU. (Cell is apparently ~220mm^2, Core 2 Duo ~140mm^2)

>> indicative of per core performance on Core 2. Is it safe to say that
>> Core 2 achieves <15 gflops/core at 3ghz, assuming ~15% premium with Goto
>> BLAS?
> 
> peak SGEMM/core would be 3*8=24, so 15 sounds quite low.
> 
>>> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a
> 
> do you know of something crippled in the pre-production Cell chips?
Clock speed?

<snip>

> I don't think there's anything too dubious about 80% of theoretical for
> Core2.  but I also didn't think the Sequoia stuff was such a cheap hack
> as you imply (not to put words into your mouth ;)
If we are to believe what's in the LBL paper, IBM is getting ~200 gflops
peak on SGEMM with full clock Cell engineering samples, which means peak
should be ~150 gflops on the chip in the article. 52%  of peak achieved
by Sequoia is probably a little low.

>> like to see a benchmark comparison of SGEMM (and DGEMM) using Core
>> 2-optimized BLAS vs. Cell-optimized BLAS, thereby making a useful
>> conclusion about how interesting Cell is for HPC.
> 
> actually, Sequoia seems precisely like the structure you need to make Cell
> work, since it's whole purpose is to express the rather constrained way
> that memory is used in Cell.  the paper is actually pretty clear on
> where the Cell
> spends its time, and for SGEMM, it's executing the "leaf" code, which is
> IBM's Cell library.
It's whole purpose is to express distributed computers with arbitrary
memory topologies, from SMP (NUMA and non-) to clusters. It actually
looks really cool.

> I guess the prototype might be really bad, or Sequoia might be broken in a
> way not hinted in the paper, or IBM's Cell intrinsic library could be
> terrible.  but the paper seems on the up-and-up, and the scaling curves and
> leave-vs-communication figures surely make Cell look underwhelming,
> at least if you assume, as I do, that it has to deliver a large speedup
> to be worth investing in...
The paper is more of a statement on the capabilities of the Sequoia
compiler than the Cell processor. I don't think it's unreasonable to
assume their SGEMM implementation was written for clarity rather than speed.

-- 
Geoffrey D. Jacobs

Go to the Chinese Restaurant,
Order the Special