[Beowulf] The death of CPU scaling: From one core to many — and why we’re still stuck

Eugen Leitl eugen at leitl.org
Thu Feb 9 03:12:12 PST 2012


The death of CPU scaling: From one core to many — and why we’re still stuck

By Joel Hruska on February 1, 2012 at 2:31 pm

It’s been nearly eight years since Intel canceled Tejas and announced its
plans for a new multi-core architecture. The press wasted little time in
declaring conventional CPU scaling dead — and while the media has a tendency
to bury products, trends, and occasionally people well before their
expiration date, this is one declaration that’s stood the test of time.

To understand the magnitude of what happened in 2004 it may help to consult
the following chart. It shows transistor counts, clock speeds, power
consumption, and instruction-level parallelism (ILP). The doubling of
transistor counts every two years is known as Moore’s law, but over time,
assumptions about performance and power consumption were also made and shown
to advance along similar lines. Moore got all the credit, but he wasn’t the
only visionary at work. For decades, microprocessors followed what’s known as
Dennard scaling. Dennard predicted that oxide thickness, transistor length,
and transistor width could all be scaled by a constant factor. Dennard
scaling is what gave Moore’s law its teeth; it’s the reason the
general-purpose microprocessor was able to overtake and dominate other types
of computers.

CPU Scaling [1]CPU scaling showing transistor density, power consumption, and
efficiency. Chart originally from The Free Lunch Is Over: A Fundamental Turn
Toward Concurrency in Software [2]

The original 8086 drew ~1.84W and the P3 1GHz drew 33W, meaning that CPU
power consumption increased by 17.9x while CPU frequency improved by 125x.
Note that this doesn’t include the other advances that occurred over the same
time period, such as the adoption of L1/L2 caches, the invention of
out-of-order execution, or the use of superscaling and pipelining to improve
processor efficiency. It’s for this reason that the 1990s are sometimes
referred to as the golden age of scaling. This expanded version of Moore’s
law held true into the mid-2000s, at which point the power consumption and
clock speed improvements collapsed. The problem at 90nm was that transistor
gates became too thin to prevent current from leaking out into the substrate.

Intel and other semiconductor manufacturers have fought back with innovations
[3] like strained silicon, hi-k metal gate, FinFET, and FD-SOI — but none of
these has re-enabled anything like the scaling we once enjoyed. From 2007 to
2011, maximum CPU clock speed (with Turbo Mode enabled) rose from 2.93GHz to
3.9GHz, an increase of 33%. From 1994 to 1998, CPU clock speeds rose by 300%.

Next page: The multi-core swerve [4]

The multi-core swerve

For the past seven years, Intel and AMD have emphasized multi-core CPUs as
the answer to scaling system performance, but there are multiple reasons to
think the trend towards rising core counts is largely over. First and
foremost, there’s the fact that adding more CPU cores never results in
perfect scaling. In any parallelized program, performance is ultimately
limited by the amount of serial code (code that can only be executed on one
processor). This is known as Amdahl’s law. Other factors, such as the
difficulty of maintaining concurrency across a large number of cores, also
limit the practical scaling of multi-core solutions.

Amdahl's Law [5]

AMD’s Bulldozer is a further example of how bolting more cores together can
result in a slower end product [6]. Bulldozer was designed to share logic and
caches in order to reduce die size and allow for more cores per processor,
but the chip’s power consumption badly limits its clock speed while slow
caches hamstring instructions per cycle (IPC). Even if Bulldozer had been a
significantly better chip, it wouldn’t change the long-term trend towards
diminishing marginal returns. The more cores per die, the lower the chip’s
overall clock speed. This leaves the CPU ever more reliant on parallelism to
extract acceptable performance. AMD isn’t the only company to run into this
problem; Oracle’s new T4 processor is the first Niagara-class chip to focus
on improving single-thread performance rather than pushing up the total
number of threads per CPU.

Rage Jobs [7]

The difficulty of software optimization is a further reason why adding more
CPU cores doesn’t help much. Game developers have made progress in using
multi-core systems, but the rate of advance has been slow. Games like Rage
[8] and Battlefield 3 — two high-profile titles that use multiple cores —
both utilized new engines designed from the ground-up with multi-core scaling
as a primary goal.

The bottom line is that its been easier for Intel and AMD to add cores than
it is for software to take advantage of them. Seven years after the
multi-core era began, it’s already morphing into something different.

Next page: The rise (and limit) of Many-Core [9]

The rise (and limit) of Many-Core

In this context, we’re using the term “many-core” to refer to a wide range of
programmable hardware. GPUs from AMD and Nvidia are both “many-core”
products, as are chips from companies like Tilera. Intel’s Knights Corner
[10] is a many-core chip.

The death of conventional scaling has sparked a sharp increase in the number
of companies researching various types of specialized CPU cores. Prior to
that point, general-purpose CPU architectures, exemplified by Intel’s x86,
had eaten through the high-end domains of add-in boards and co-processors at
a ferocious rate. Once that trend slammed into the brick wall of physics,
more specialist architectures began to appear.

Many-core Scaling [11]Note: Three exclamation points doesn’t actually mean
anything, despite the fondest wishes of AMD’s marketing department

Despite what some companies like to claim, specialized many-core chips don’t
“break” Moore’s law in any way and are not exempt from the realities of
semiconductor manufacturing. What they offer is a tradeoff — a less general,
more specialized architecture that’s capable of superior performance on a
narrower range of problems. They’re also less encumbered by socket power
constraints — Intel’s CPUs top out at 140W TDP; Nvidia’s upper-range GPUs are
in the 250W range.

Intel’s upcoming Many Integrated Core (MIC) architecture is partly an attempt
to capitalize on the benefits of having a separate interface and giant PCB
for specialized, ultra-parallel data crunching. AMD, meanwhile, has focused
on consumer-side applications and the integration of CPU and GPU via what it
calls Graphics Core Next [12]. Regardless of market segmentation, all three
companies are talking about integrating specialized co-processors that excel
at specific tasks, one of which happens to be graphics.

AMD's many-core strategy [13]

Unfortunately, this isn’t a solution. Incorporating a specialized many-core
processor on-die or relying on a discrete solution to boost performance is a
bid to improve efficiency per watt, but it does nothing to address the
underlying problem that transistors can no longer be counted on to scale the
way they used to. The fact that transistor density continues to scale while
power consumption and clock speed do not has given rise to a new term: dark
silicon. It refers to the percentage of silicon on a processor that can’t be
powered up simultaneously without breaching the chip’s TDP.

A recent report in dark silicon and the future of multi-core devices
describes the future in stark terms. The researchers considered both
transistor scaling as forecast by the International Technology Roadmap for
Semiconductors (ITRS) and by a more conservative amount; they factored in the
use of APU-style combinations, the rise of so-called “wimpy” cores [14], and
the future scaling of general-purpose multiprocessors. They concluded:

    Regardless of chip organization and topology, multicore scaling is power
limited to a degree not widely appreciated by the computing community… Given
the low performance returns… adding more cores will not provide sufficient
benefit to justify continued process scaling. Given the time-frame of this
problem and its scale, radical or even incremental ideas simply cannot be
developed along typical academic research and industry product cycles… A new
driver of transistor utility must be found, or the economics of process
scaling will break and Moore’s Law will end well before we hit final
manufacturing limits

Over the next few years scaling will continue to slowly improve. Intel will
likely meander up to 6-8 cores for mainstream desktop users at some point,
quad-cores will become standard at every product level, and we’ll see much
tighter integration of CPU and GPU. Past that, it’s unclear what happens
next. The gap between present-day systems and DARPA’s exascale computing
initiative [15] will diminish only marginally with each successive node;
there’s no clear understanding of how — or if — classic Dennard scaling can
be re-initiated.

This is part one of a two-part story. Part two will deal with how Intel is
addressing the problem through what it calls the “More than Moore” approach
and its impact on the mobile market.


    : http://www.extremetech.com/wp-content/uploads/2012/02/CPU-Scaling.jpg

    The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in
Software: http://www.gotw.ca/publications/concurrency-ddj.htm

    fought back with innovations:

    The multi-core swerve:

    : http://www.extremetech.com/wp-content/uploads/2012/02/Amdahl.png

    a slower end product:

    : http://www.extremetech.com/wp-content/uploads/2012/02/Rage-Jobs.jpg


    The rise (and limit) of Many-Core:

    Knights Corner:

    : http://www.extremetech.com/wp-content/uploads/2012/02/Scaling1.jpg

    Graphics Core Next:

    : http://www.extremetech.com/wp-content/uploads/2012/02/ManyCoreAMD.jpg

    “wimpy” cores:

    DARPA’s exascale computing initiative:

More information about the Beowulf mailing list