[Beowulf] more details on Cell emerge

Eugen Leitl eugen at leitl.org
Fri Feb 11 06:06:43 PST 2005


By: David T. Wang (dwang at realworldtech.com)
Updated: 02-10-2005
Back to Basics

The fundamental task of a processor is to manage the flow of data through its
computational units. However in the past two decades, each successive
generation of processors for personal computers has added more transistors
dedicated to increasing the performance of spaghetti-like integer code. For
example, it is well known that typical integer codes are branchy and that
branch mispredict penalties are expensive; in an effort to minimize the
impact of branch instructions, transistors were used to develop highly
accurate branch predictors. Aside from branch predictors, sophisticated cache
hierarchies with large tag arrays and predictive cache prefetch units attempt
to hide the complexity of data movement from the software, and further
increase the performance of single threaded applications. The pursuit of
single threaded performance can be observed in recent years in the proposal
of extraordinarily deeply pipelined processors designed primarily to increase
the performance of single threaded applications, at the cost of higher power
consumption and larger transistor budgets.

The fundamental idea of the CELL processor project is to reverse this trend
and give up the pursuit of single threaded performance, in favor of
allocating additional hardware resources to perform parallel computations.
That is, minimal resources are devoted toward the execution of single
threaded workloads, so that multiple DSP-like processing elements can be
added to perform more parallelizable multimedia-type computations. In the
examination of the first implementation of the CELL processor, the theme of
the shift in focus from the pursuit of single threaded integer performance to
the pursuit of multiply threaded, easily parallelizable multimedia-type
performance is repeated throughout.
CELL Basics

The CELL processor is a collaboration between IBM, Sony and Toshiba. The CELL
processor is expected by this consortium to provide computing power an order
of magnitude above and beyond what is currently available to its competitors.
The International Solid-State Circuits Conference (ISSCC) 2005 was chosen by
the group as the location to describe the basic hardware architecture of the
processor and announce the first incarnation of the CELL processor family.

Members of the CELL processor family share basic building blocks, and
depending on the requirement of the application, specific versions of the
CELL processor can be quickly configured and manufactured to meed that need.
The basic building blocks shared by members of the CELL family of processor
are the following:

    * The PowerPC Processing Element (PPE)
    * The Synergistic Processing Element (SPE)
    * The L2 Cache
    * The internal Element Interconnect Bus(EIB)
    * The shared Memory Interface Controller (MIC) and
    * The FlexIO interface

Each SPE is in essence a private system-on-chip (SoC), with the processing
unit connected directly to 256KB of private Load Store (LS) memory. The PPE
is a dual threaded (SMT) PowerPC processor connected to the SPE's through the
EIB. The PPE and SPE processing elements access system memory through the
MIC, which is connected to two independent channels of Rambus XDR memory,
providing 25 GB/s of memory bandwidth. The connection to I/O is done through
the FlexIO interface, also provided by Rambus, providing 44.8 GB/s of raw
outbound BW and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of
76.8 GB/s. At ISSCC 2005, IBM announced that the first implementation of the
CELL processor has been tested to operate at frequencies above 4 GHz. In the
CELL processor, each SPE is capable of sustaining 4 FMADD operations per
cycle. At an operating frequency of 4 GHz, the CELL processor is thus capable
of achieving a peak throughput rate of 256 GFlops from the 8 SPE's. Moreover,
the PPE can contribute some amount of additional compute power with its own
FP and VMX units.
Processor Overview

Figure 1 - Die photo of CELL processor with block diagram overlay

Figure 1 shows the die photo of the first CELL processor implementation with
8 SPE.s. The sample processor tested was able to operate at a frequency of 4
GHz with Vdd of 1.1V. The power consumption characteristics of the processor
were not disclosed by IBM. However, estimates in the range of 50 to 80 Watts
@ 4 GHz and 1.1 V were given. One unconfirmed report claims that at the
extreme end of the frequency/voltage/power spectrum, one sample CELL
processor was observed to operate at 5.6 GHz with 1.4 V Vdd and consumed 180
W of power.

As described previously, the CELL processor with 8 SPE.s operating at 4 GHz
has a peak throughput rate of over 256 GFlops. To provide the proper balance
between processing power and data bandwidth, an enormously capable system
interconnects and memory system interface is required for the CELL processor.
For that task, the CELL processor was designed as a Rambus Sandwich, with
Redwood Rambus Asic Cell (RRAC) acting as the system interface on one end of
the CELL processor, and the XDR (formerly Yellowstone) high bandwidth DRAM
memory system interface on the other end of the CELL processor. Finally, the
CELL processor has 2954 C4 contacts to the 3-2-3 organic package, and the BGA
package is 42.5 mm by 42.5 mm in size. The BGA package contains 1236
contacts, 506 of which are signal interconnects and the remainder are devoted
to power and ground interconnects.
Logic Depth, Circuit Design, Die Size and Process Shrink

Figure 2 - Per stage circuit delay depth of 11 FO4 often left only 5~8 FO4
for logic flow

The first incarnation of the CELL processor is implemented in a 90nm SOI
process. IBM claims that while the logic complexity of each pipeline stage is
roughly comparable to other processors with a per stage logic depth of 20
FO4, aggressive circuit design, efficient layout and logic simplification
enabled the circuit designers of the CELL processor to reduced the per stage
circuit delay to 11 FO4 throughout the entire design. The design methodology
deployed for the CELL processor project provides an interesting contrast to
that of other IBM processor projects in that the first incarnation of the
CELL processor makes use of fully custom design. Moreover, the full custom
design includes the use of dynamic logic circuits in critical data paths. In
the first implementation of the CELL processor, dynamic logic was deployed
for both area minimization as well as performance enhancement to reach the
aggressive goal of 11 FO4 circuit delay per stage. Figure 2 shows that with
the circuit delay depth of 11 FO4, oftentimes only 5~8 FO4 are left for
inter-latch logic flow.

The use of dynamic logic presents itself as an interesting issue in that
dynamic logic circuits rely on the capability of logic transistors to retain
a capacitive load as temporary storage. The decreasing capacitance and
increasing leakage of each successive process generation means that dynamic
logic design becomes more challenging with each successive process
generation. In addition, dynamic circuits are reportedly even more
challenging on SOI based process technologies. However, circuit design
engineers from IBM believe that the use of dynamic logic will not present
itself as an issue in the scalability of the CELL processor down to 65 nm and
below. The argument was put forth that since the CELL processor is a full
custom design, the task of process porting with dynamic circuits is no more
and no less challenging than the task of process porting on a design without
dynamic circuits. That is, since the full custom design requires the
re-examination and re-optimization of transistor and circuit characteristics
for each process generation, if a given set of dynamic logic circuits become
impractical for specific functions at a given process node, that set of
circuits can be replaced with static circuits as needed.

The process portability of the CELL processor design is an interesting topic
due to the fact that the prototype CELL processor is a large device that
occupies 221 mm2 of silicon area on the 90 nm process. Comparatively, the IBM
PPC970FX processor has a die size of 62 mm2 on the 90 nm process. The natural
question then arises as to whether Sony will choose to reduce the number of
SPE.s to 4 for the version of the CELL processor to appear in the next
generation Playstation, or keep the 8 SPE.s and wait for the 65 nm process
before it ramps up the production of the next generation Playstation.
Although no announcements or hints have been given, IBM.s belief in regards
to the process portability of the CEL
Figure 6 - SPE pipeline diagram

Table 1 - Unit latencies for SPE instructions.

Figure 6 shows the pipeline diagram of the SPE and Table 1 shows the unit
latency of the SPE. Figure 6 shows that the SPE pipeline makes heavy use of
the forward-and-delay concept to avoid the access latency of a register file
access in the case of dependent instructions that flow through the pipeline
in rapid succession.

One interesting aspect of the floating point pipeline is that the same arrays
are used for floating point computation as well as integer multiplication. As
a result, integer multiplies are sent to the floating point pipeline, and the
floating point pipeline bypasses the FP handling and computes the integer
SPE Schmoo Plot

Figure 7 - Schmoo plot for the SPE

Figure 7 shows the schmoo plot for the SPE. The schmoo plot shows that the
SPE can comfortably operate at a frequency of 4 GHz with Vdd of 1.1 V,
consuming approximately 4 W. The schmoo plot also reveals that due to the
careful segmentation of signal path lengths, the design is far from being
wire delay limited. Frequency scaling relative to voltage continues past 1.3
V. This schmoo plot also contributes to the plausibility of the unconfirmed
report that the CELL processor could operate at upwards of 5.6 GHz.
.Unknown. Functional Units: ATO and RTB

Oftentimes when a paper relating to a complex project is written
collaboratively by a group of people, details are lost. Still, it appeared as
rather humorous that of the six design engineers and architects from the CELL
processor project present at Tuesday evening.s chat session, no one could
recall what the acronyms ATO and RTB stood for. ATO and RTB are functional
blocks labeled in the floorplan of the SPE. However, the functionality of
these functional blocks or the meaning of the acronym were neither noted on
the floorplan, nor explained in the paper, nor mentioned in the technical
presentation. In an effort to cover all the corners, this author placed the
question on a list of questions to be asked of the CELL project team members.
Hilarity thus ensued as slightly embarrassed CELL project members stared
blankly at each other in an attempt to recall the functionality or definition
of the acronyms.

In all fairness, since the SPE was presented on Monday and the CELL processor
itself was presented on Tuesday, CELL project members responsible for the SPE
were not present for Tuesday evening.s chat sessions. As a result, the team
members responsible for the overall CELL processor and internal system
interconnects were asked to recall the meaning of acronyms of internal
functional units within the SPE. Hence, the task was unnecessarily
complicated by the absence of key personnel that would have been able to
provide the answer faster than the CELL processor can rotate a million
triangles by 12 degrees about the Z axis.

After some discussion (and more wine), it was determined that the ATO unit is
most likely the Atomic (memory) unit responsible for coherency
observation/interaction with dataflow on the EIB. Then, after the injection
of more liquid refreshments (CH3CH2OH), it was theorized that the RTB most
likely stood for some sort of Register Translation Block whose precise
functionality was unknown to those outside of the SPE. However, this theory
would turn out to be incorrect.

Finally, after sufficient numbers of hydrocarbon bonds have been broken down
into H-OH on Wednesday, a member of the CELL processor team member tracked
down the relevant information and he writes:

The R in RTB is an internal 1 character identifier that denotes that the RTB
block is a unit in the SPE. The TB in RTB stands for "Test Block". It
contains the ABIST (Array Built In Self Test) engines for the Local Store and
other arrays in the SPE, as well as other test related control functions for
the SPE.
Element Interconnect Bus

The element interconnect bus is the on chip interconnect that ties together
all of the processing, memory, and I/O elements on the CELL processor. The
EIB is implemented as a set of four concentric rings that is routed through
portions of the SPE, where each ring is a 128 bit wide interconnect. To
reduce coupling noises, the wires are arranged in groups of four and
interleaved with ground and power shields. To further reduce coupling noises,
the direction of data flow alternates between each adjacent ring pair. Data
travels on the EIB through staged buffer/repeaters at the boundaries of each
SPE. That is, data is driven by one set of staged buffer and latched by the
buffer at the next stage every clock cycle. Data moving from one SPE through
other SPE.s requires the use of repeaters in the intermediary SPE.s for the
duration of the transfer. Independently from the buffer/repeater elements,
separate data on/off ramps exist in the BIU of the SPE, as data targeted for
the LS unit of a given SPE can be off-loaded at the BIU. Similarly, outgoing
data can be placed onto the EIB by the BIU.

Figure 8 - Counter rotational rings of the EIB - 4 SPE.s shown

The design of the EIB is specifically geared toward the scalability of the
CELL processor. That is, signal path lengths on the EIB do not change
regardless of the number of SPE.s in a given CELL processor configuration.
Since the data travels no more than the width of one SPE, more SPE.s on a
given CELL processor simply means that the data transport latency increases
by the number of additional hops through those SPE.s. Data transfer through
the EIB is controlled by the EIB controller, and the EIB controller works
with the DMA engine and the channel controllers to reserve the buffers
drivers for certain number of cycles for each data transfer request. The data
transfer algorithm works by reserving channel capacity for each data
transfer, thus providing support for real time applications. Finally, the
design and implementation of the EIB has a curious side effect in that it
limits the current version of the CELL processor to expand only along the
horizontal axis. Thus, the EIB enables the CELL processor to be highly
configurable and SPE.s can be quickly and easily added or removed along the
horizontal axis, and the maximum number of SPE.s that can be added is set by
the maximum width of the chip allowable by the reticule size of the
fabrication equipment.
The POWERPC Processing Element

Neither microarchitectural details nor the performance characteristics of the
POWERPC Processing Element were disclosed by IBM during ISSCC 2005. However,
what is known is that the PPE processor core is a new core that is fully
compliant with the POWERPC instruction set, the VMX instruction set extension
inclusive. Additionally, the PPE core is described as a two issue, in-order,
64 bit processor that supports 2 way SMT. The L1 cache sizes of the PPE is
reported to be 32KB each, and the unified L2 cache is 512 KB in size.
Furthermore, the lineage of the PPE can be traced to a research project
commissioned by IBM to examine high speed processor design with aggressive
circuit implementations. The results of this research project were published
by IBM first in the Journal of Solid State Circuits (JSSC) in 1998, then
again in ISSCC 2000.

The paper published in JSSC in 1998 described a processor implementation that
supported a subset of the POWERPC instruction set, and the paper published in
ISSCC 2000 described a processor that supported the complete POWERPC
instruction set and operated at 1 GHz on a 0.25µm process technology. The
microarchitecture of the research processor was disclosed in some detail in
the ISSCC 2000 paper. However, that processor was a single issue processor
whose design goal was to reach high operating frequency by limiting pipestage
delay to 13 FO4, and power consumption limitations were not considered. For
the PPE, several major changes in the design goal dictated changes in the
microarchitecture from the research processor disclosed at ISSCC in 2000.
Firstly, to further increase frequency, the per stage circuit delay design
target was lowered from 13 FO4 to 11 FO4. Secondly, limiting power
consumption and minimize leakage current were added as high priority design
goals for the PPE. Collectively, these changes limited the per stage logic
depth, and the pipeline was lengthened as a result. The addition of SMT and
the two issue design goal completed the metamorphosis of the research
processor to the PPE. The result is a processing core that operates at a high
frequency with relatively low power consumption, and perhaps relatively
poorer scalar performance compared to the beefy POWER5 processor core.
Rambus XDR Memory System

Figure 9 - The two channel XDR Memory System

To provide machine balance and support the peak rating of more than 256 SP
GFlops (or 25-30 DP GFlops), the CELL processor requires an enormously
capable memory system. For that reason, two channels of Rambus XDR memory are
used to obtain 25.2 GB/s of memory bandwidth. In the XDR memory system, each
channel can support a maximum of thirty-six devices connected to the same
command and address bus. The data bus for each device connects to the memory
controller through a set of bi-directional point-to-point connections. In the
XDR memory system, addresses and commands are sent on the address and command
bus at a rate of 800 Mbits per second (Mbps), and the point to point
interface operates at a datarate of 3.2 Gbps. Using DRAM devices with 16 bit
wide data busses, each channel of XDR memory can sustain a maximum bandwidth
of 102.4 Gbps (2 x 16 x 3.2), or 12.6 GB/s. The CELL processor can thus
achieve a maximum bandwidth of 25.2 GB/s with a 2 channel, 4 device

The obvious advantage of the XDR memory system is the bandwidth that it
provides to the CELL processor. However, in the configuration illustrated in
figure 9, the maximum of 4 DRAM devices means that the CELL processor is
limited to 256 MB of memory, given that the highest capacity XDR DRAM device
is currently 512 Mbits. Fortunately, XDR DRAM devices could in theory be
reconfigured in such a way so that more than 36 XDR devices can be connected
to the same 36 bit wide channel and provide 1 bit wide data bus each to the
36 bit wide point-to-point interconnect. In such a configuration, a two
channel XDR memory can support upwards of 16 GB of ECC protected memory with
256 Mbit DRAM devices or 32 GB of ECC protected memory with 512 Mbit DRAM
devices. As a result, the CELL processor could in theory address a large
amount of memory if the price premium of XDR DRAM devices could be minimized.
IBM did not release detailed information about the configuration of the XDR
memory system. One feature to watch for in the future is ECC support in the
DRAM memory system. Since ECC support is clearly not a requirement of a
processor to be used in a game machine, the presence of ECC support would
likely indicate IBM.s ambition to promote the use of CELL processors in
applications that require superior reliability, availability and
serviceability, such as HPC, workstation or server systems.

Incidentally, Toshiba is a manufacturer of XDR DRAM devices. Presumably it
brought the XDR memory controller and memory system design expertise to the
table, and could ramp up production of XDR DRAM devices as needed.
FlexIO System Interface
At ISSCC 2005, Rambus presented a paper on the FlexIO interface used on the
CELL processor. However, the presentation was limited to describing the
physical layer interconnect. Specifically, the difficulties of implementing
the Redwood Rambus ASIC Cell on IBM.s 90nm SOI process were examined in some
detail. While circuit level issues regarding the challenges of designing high
speed I/O interfaces on an SOI based process are in their own right extremely
intriguing topics, the focus of this article is geared toward the
architectural implications of the high bandwidth interface. As a result, the
circuit level details will not be covered here. Interested readers are
encouraged to seek out details on Rambus.s Redwood technology separately.

What is known about the system interface of the CELL processor is that the
FlexIO consists of 12 byte lanes. Each byte lane is a set of 8 bit wide,
source synchronous, unidirectional, point-to-point interconnects. The FlexIO
makes use of differential signaling to achieve the data rate of 6.4 Gb per
second per signal pair, and that data rate in turn translates to 6.4 GB/s per
byte lane. The 12 byte lanes are asymmetric in configuration. That is, 7 byte
lanes are outbound from the CELL processor, while 5 byte lanes are inbound to
the CELL processor. The 12 byte lanes thus provide 44.8 GB/s of raw outbound
bandwidth and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of
76.8 GB/s. Furthermore, the byte lanes are arranged into two groups of ports:
one group of ports are dedicated to non-coherent off-chip traffic, while the
other group of ports are usable for coherent off-chip traffic. It seems clear
that Sony itself is unlikely to make use of a coherent, multiple CELL
processor configuration for Playstation 3. However, the fact that the PPE and
the SPE.s can snoop traffic transported through the EIB, and that coherency
traffic can be sent to other CELL processors via a coherent interface, means
that the CELL processor can indeed be an interesting processor. If nothing
else, the CELL processor should enable startups that propose to build FlexIO
based coherency switches to garner immediate interest from venture

The CELL processor presents an intriguing alternative in its pursuit of
performance. It seems to be a forgone conclusion that the CELL processor will
be an enormously successful product, and that millions of CELL processors
will be sold as the processors that power the next generation Sony
Playstation. However, IBM has designed some features into the CELL processor
that clearly reveals its ambition in seeking new applications for the CELL
processor. At ISSCC 2005, much fanfare has been generated by the rating of
256 GFlops @ 4 GHz for the CELL processor. However, it is the little
mentioned double precision capability and the yet undisclosed system level
coherency mechanism that appear to be the most intriguing aspects that could
enable the CELL processor to find success not just inside the Playstation,
but outside of it as well.

[1] J. Silberman et. al., .A 1.0- GHz Single-Issue 64-Bit PowerPC Integer
Processor., IEEE Journal of Solid-State Circuits, Vol 33, No.11, Nov 1998.
[2] P. Hofstee et. al., .A 1 GHz Single-Issue 64b PowerPC Processor.,
International Solid-State Circuits Conference Technical Digest, Feb. 2000.
[3] N. Rohrer et. al. .PowerPC in 130nm and 90nm Technologies., International
Solid-State Circuits Conference Technical Digest, Feb. 2004.
[4] B. Flachs et. al. .A Streaming Processing Unit for A CELL Processor.,
International Solid-State Circuits Conference Technical Digest, Feb. 2005.
[5] D. Pham et. al. .The Design and Implementation of a First-Generation CELL
Processor., International Solid-State Circuits Conference Technical Digest,
Feb. 2005.
[6] J. Kuang et. al. .A Double-Precision Multiplier with Fine-Grained
Clock-Gating Support for a First-Generation CELL Processor., International
Solid-State Circuits Conference Technical Digest, Feb. 2005.
[7] S. Dhong et. al. .A 4.8 GHz Fully Pipelined Embedded SRAM in the
Streaming Processor of a CELL Processor., International Solid-State Circuits
Conference Technical Digest, Feb. 2005.
[8] K. Chang et. al. .Clocking and Circuit Design for a Parallel I/O on a
First-Generation CELL Processor., International Solid-State Circuits
Conference Technical Digest, Feb. 2005. 

Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050211/c7cfa2c4/attachment.sig>

More information about the Beowulf mailing list