[Beowulf] Nhalem EX

Kilian CAVALOTTI kilian.cavalotti.work at gmail.com
Thu Sep 24 04:56:23 PDT 2009


On Thu, Sep 24, 2009 at 12:40 PM, Hearns, John <john.hearns at mclaren.com> wrote:
> Internal ring buses? How long till you lot are benchmarking them and
> claiming your
> code is taking too long because the data is moving round the bus in the
> wrong direction :-)

Well, if you add cache coloring
(http://en.wikipedia.org/wiki/Cache_coloring) to the mix, you can
pretty much have the whole DC metro running in you cores. :)

> I thought understanding L1, 2 and 3 caches was hard enough, without
> having to think about rings.

Since creating a monolithic 24MB L3 cache would have make it slow as a
slug, they basically added a second level of L2 cache, local to each
core, and connected them together with a bidirectionnal bus, so that
"if any core needs a byte from any other cache, it is no more than 4
ring hops to the right cache slice."
It looks a bit like HT Assist
(http://www.bit-tech.net/news/hardware/2009/06/01/amd-launches-6-core-istanbul-opteron-proces/1)?
Except it's in-chip rather than inter-CPUs. And it's supposed to
behave like a large shared L3.

> Ah well.  Toroids on chip next?

Further down, in the posted article:
"The transistor count of 2.3 billion backs that up. To make it all
work, the center of the chip has a block called the router. It is a
crossbar switch that connects all internal and external channels, up
to eight at a time."

The chip itself is becoming a NUMA-like system, with its own internal
network, a crossbar switch and its own internal topology. At some
time, if the number of cores continues to grow, it wouldn't be that
surprising to see some locality emerge, in the form of local clusters
of cores, tightly coupled on a bus ring, and interconnected to other
cluster of cores through QPI links (intra- or inter-chips). Network
architectures as we see them today at the Infiniband interconnect
level could very well make their way into the chips.

So yes, toroids, why not? :)

"With 4 QPI links, 8 memory channels, 8 cores, 8 cache slices, 2
memory controllers, 2 cache agents, 2 home agents and a pony, this
chip is getting quite complex."

You bet!

Cheers,
-- 
Kilian



More information about the Beowulf mailing list