[Beowulf] CPU Startup Combines CPU+DRAM—And A Whole Bunch Of Crazy

Eugen Leitl eugen at leitl.org
Mon Jan 23 05:45:10 PST 2012

(Old idea, makes sense, will they be able to pull it off?)


CPU Startup Combines CPU+DRAM—And A Whole Bunch Of Crazy

Sunday, January 22, 2012 - by Joel Hruska

The CPU design firm Venray Technology announced a new product design this
week that it claims can deliver enormous performance benefits by combining
CPU and DRAM on to a single piece of silicon. We spent some time earlier this
fall discussing the new TOMI (Thread Optimized Multiprocessor) with company
CTO Russell Fish, but while the idea is interesting; its presentation is
marred by crazy conceptualizing and deeply suspect analytics.

The Multicore Problem:

There are three limiting factors, or walls, that limit the scaling of modern
microprocessors. First, there's the memory wall, defined as the gap between
the CPU and DRAM clock speed. Second, there's the ILP (Instruction Level
Parallelism) wall, which refers to the difficulty of decoding enough
instructions per clock cycle to keep a core completely busy. Finally, there's
the power wall--the faster a CPU is and the more cores it has, the more power
it consumes.

Attempting to compensate for one wall often risks running afoul of the other
two. Adding more cache to decrease the impact of the CPU/DRAM speed
discrepancy adds die complexity and draws more power, as does raising CPU
clock speed. Combined, the three walls are a set of fundamental
constraints--improving architectural efficiency and moving to a smaller
process technology may make the room a bit bigger, but they don't remove the
walls themselves.

TOMI attempts to redefine the problem by building a very different type of
microprocessor. The TOMI Borealis is built using the same transistor
structures as conventional DRAM; the chip trades clock speed and performance
for ultra-low low leakage. Its design is, by necessity, extremely simple. Not
counting the cache, TOMI is a 22,000 transistor design, as compared to 30,000
transistors for the original ARM2. The company's early prototypes, built on
legacy DRAM technology, ran at 500MHz on a 110nm process.

Instead of surrounding a CPU core with a substantial amount of L2 and L3
cache, Venray inserted a CPU core directly into a DRAM design. A TOMI
Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16
ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM.
Because they're built using ultra-low-leakage processes and are so small,
such cores cost very little to build and consume vanishingly small amounts of
power (Venray claims power consumption is as low as 23mW per core at 500MHz).

It's an interesting idea.

The Bad:

When your CPU has fewer transistors than an architecture that debuted in
1986, it's a good chance that you left a few things out--like an FPU, branch
prediction, pipelining, or any form of speculative execution. Venray may have
created a chip with power consumption an order of magnitude lower than
anything ARM builds and more memory bandwidth than Intel's highest-end Xeons,
but it's an ultra-specialized, ultra-lightweight core that trades 25 years of
flexibility and performance for scads of memory bandwidth.

The last few years have seen a dramatic surge in the number of low-power,
many-core architectures being floated as the potential future of computing,
but Venray's approach relies on the manufacturing expertise of companies who
have no experience in building microprocessors and don't normally serve as
foundries. This imposes fundamental restrictions on the CPU's ability to
scale; DRAM is manufactured using a three layer mask rather than the 10-12
layers Intel and AMD use for their CPUs. Venray already acknowledges that
these conditions imposed substantial limitations on the original TOMI design.

Of course, there's still a chance that the TOMI uarch could be effective in
certain bandwidth-hungry scenarios--but that's where the Venray Crazy Train
goes flying off the track.

The Disingenuous and Crazy

Let's start here. In a graph like this, you expect the two bars to represent
the same systems being compared across three different characteristics.
That's not the case. When we spoke to Russell Fish in late November, he
pointed us to this publicly available document and claimed that the results
came from a customer with 384 2.1GHz Xeons. There's no such thing as an S5620
Xeon and even if we grant that he meant the E5620 CPU, that's a 2.4GHz chip.

The "Power consumption" graphs show Oracle's maximum power consumption for a
system with 10x Xeon E7-8870s, 168 dedicated SQL processors, 5.3TB (yes, TB)
of Flash and 15x 10,000 RPM hard drives. It's not only a worst-case figure,
it's a figure utterly unrelated to the workload shown in the Performance
comparison. Furthermore, given that each Xeon E7-8870 has a 130W TDP, ten of
them only come out to 1.3kW--Oracle's 17.7kW figure means that the
overwhelming majority of the cabinet's power consumption is driven by
components other than its CPUs.

>From here, things rapidly get worse. Fish makes his points about power walls
by referring to unverified claims that prototype 90nm Tejas chips drew 150W
at 2.8GHz back in 2004. That's like arguing that Ford can't build a decent
car because the Edsel sucked.

After reading about the technology, you might think Venray was planning to
market a small chip to high-end HPC niche markets... and you'd be wrong. The
company expects the following to occur as a result of this revolutionary
architecture (organized by least-to-most creepy):

    Computer speech will be so common that devices will talk to other devices
in the presence of their users.

    Your cell phone camera will recognize the face of anyone it sees and scan
the computer cloud for backround red flags as well as six degrees of

    Common commands will be reduced to short verbal cues like clicking your
tongue or sucking your lips

    Your personal history will be displayed for one and all to see...women
will create search engines to find eligible, prosperous men. Men will create
search engines to qualify women. Criminals will find their jobs much more
difficult because their history will be immediately known to anyone who
encounters them.

    TOMI Technology will be built on flash memories creating the elemental
unit of a learning machine... the machines will be able to self organize,
build robust communicating structures, and collaborate to perform tasks.

    A disposable diaper company will give away TOMI enabled teddy bears that
teach reading and arithmetic. It will be able to identify specific
children... and from time to time remind Mom to buy a product. The bear will
also diagnose a raspy throat, a cough, or runny nose.


Fish has spent decades in the microprocessor industry--he invented the first
CPU to use a clock multiplier in conjunction with Chuck H. Moore--but his
vision of the future is crazy enough to scare mad dogs and Englishmen.

His idea for a CPU architecture is interesting, even underneath the
obfuscation and false representation, but too practically limited to ever
take off. Google, an enthusiastic and dedicated proponent of energy
efficient, multi-core research said it best in a paper titled "Brawny cores
still beat wimpy cores, most of the time."

 "Once a chip’s single-core performance lags by more than a factor to two or
so behind the higher end of current-generation commodity processors, making a
business case for switching to the wimpy system becomes increasingly
difficult... So go forth and multiply your cores, but do it in moderation, or
the sea of wimpy cores will stick to your programmers’ boots like clay."

More information about the Beowulf mailing list