[Beowulf] Chinese Chip Wins Energy-Efficiency Crown

Eugen Leitl eugen at leitl.org
Wed May 4 03:33:46 PDT 2011


Chinese Chip Wins Energy-Efficiency Crown 

Though slower than competitors, the energy-saving Godson-3B is destined 
for the next Chinese supercomputer

By Joseph Calamia  /  May 2011

The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in
the third quarter of 2011, will have something quite different under its
hood. Unlike its forerunners, which employed American-born chips, this
machine will harness the country's homegrown high-end processor, the
Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than
its competitors' wares, at least one of which operates at more than 5 GHz,
but the chip still turns heads with its record-breaking energy efficiency. It
can execute 128 billion floating-point operations per second using just 40
watts—double or more the performance per watt of competitors.

The Godson has an eccentric interconnect structure—for relaying messages
among multiple processor cores—that also garners attention. While Intel and
IBM are commercializing chips that will shuttle communications between cores
merry-go-round style on a "ring interconnect," the Godson connects cores
using a modified version of the gridlike interconnect system called a mesh
network. The processor's designers, led by Weiwu Hu at the Chinese Academy of
Sciences, in Beijing, seem to be placing their bets on a new kind of layout
for future high-end computer processors.

A mesh design goes hand in hand with saving energy, says Matthew Mattina,
chief architect at the San Jose, Calif.–based Tilera Corp., a chipmaker now
shipping 36- and 64-core processors using on-chip mesh interconnects.

Imagine a ring interconnect as a traffic roundabout. Getting to some exits
requires you to drive nearly around the entire circle. Traveling away from
your destination before getting there, says Mattina, requires more transistor
switching and therefore consumes more energy. A mesh network is more like a
city's crisscrossed streets. "In a mesh, you always traverse the minimum
amount of wire—you're never going the wrong way," he says.

On the 8-core Godson chip, 4 cores form a tightly bound unit—each core sits
on a corner of a square of interconnects, as in a usual mesh. Godson
researchers have also connected each corner to its opposite, using a pair of
diagonal interconnects to form an X through the square's center. A "crossbar"
interconnect then serves as an overpass, linking this 4-core neighborhood to
a similar 4-core setup nearby.

Godson developers believe that their modified mesh's scalability will prove a
key advantage, as chip designers cram more cores onto future chips. Yunji
Chen, a Godson architect, says that competitors' ring interconnects may have
trouble squeezing in more than 32 cores.

Indeed, one of the ring's benefits could prove its future liability. Linking
new cores to a ring is fairly easy, says K.C. Smith, an emeritus professor of
electrical and computer engineering at the University of Toronto. After all,
there's only one path to send information—or two in a bidirectional ring. But
sharing a common communication path also means that each additional core adds
to the length of wire that messages must travel and increases the demand for
that path. With a large number of cores, "the timing around this ring just
gets out of hand," Smith says. "You can't get service when you need it."

Of course, adding more cores in a mesh also stresses the system. Even if you
have a grid of paths providing multiple communication channels, more cores
increase the demand for the network, and more demand makes traveling long
distances difficult: Try driving across New York City at rush hour. Still,
the bandwidth scaling of a mesh interconnect is superior to that of a ring,
Tilera's Mattina says. He notes that the total bandwidth available with a
mesh interconnect increases as you add cores, but with a ring interconnect,
the total bandwidth remains constant even as the core count increases.
Latency—the time it takes to get a message from one core to another—is also
more favorable in a mesh design, Chen says. In a ring interconnect, latency
increases linearly with the core count, he says, while in a mesh design it
increases with the square root of the number of cores.

Reid Riedlinger, a principal engineer at Intel, points out that a ring
interconnect has its own scalability benefits. Intel's recently unveiled
8-core Poulson design employs a ring not only to add more cores but also to
add easy-to-access on-chip memory, or cache. As long as the chip has the
power and the space, Riedlinger says, a ring makes it easy to add each core
and cache as a module—a move that would require more complicated validity
studies and logic modification in a mesh. "Adding the additional ring stop
has a very small impact on latency, and the additional cache capacity will
provide performance benefits for many applications," he says.

For those who are not building a national supercomputer, Riedlinger also
points out that a ring setup is more easily scalable in a different
direction. "You might start with an 8-core design," he says, "and then, to
suit a different market segment, you might chop 4 cores out of the middle and
sell it as a different product."

This article originally appeared in print as "China's Godson Gamble".

More information about the Beowulf mailing list