[Beowulf] A look at the 100-core Tilera Gx
gerry.creager at tamu.edu
Wed Nov 4 06:03:28 PST 2009
I think it was the recent IEEE Spectrum, where they talk about using the
Tilera 100-core chips for HPC, tuned to a specific problem using FPGA
for optimizing the chips to the problem. The argument is to use a
lower-power system with huge numbers of cores and efficient on-chip
switching, to replace the xommon x86(_64) architecture we've come to
know and hate for its energy consumption and heat generation.
Personal take: conventional systems will win this battle (ask SiCortex:
<sigh> a great idea overwhelmed by investors who couldn't see its
longer-term benefits), but that we just might see changes to slower but
more efficient cores. Via Epia-10k comes to mind, as do the Atom and
several other variants. A little slower switching fabric (gigabit) with
some changes to the core thinking of integration designers, will be
required, but I think we could make that 20kcore Atom system using
gigabit work pretty well compared to a 4k core Nehalem with QDR.
The big thing is reworking our thinking: It costs a LOT (we've said this
all before) to create the power and cooling infrastructure for serious
HPC, and I'll posit now that "serious" requires at least 4k x86_64 cores
in today's logic. If the cost of powering and cooling all this stuff is
considered, it's a huge expense, but then, a lot of us are at academic
institutions, and don't have to consider infrastructure... or didn't
until recently. Example: I have no place to expand our HPC, since we've
maxed out power and cooling in the machine room we're currently in.
And, in the only reasonable space I can build out to expand into,
power's $90K and cooling another $100K to expand, allowing an additional
20 racks. Of x86_64 and QDR. In fact, while I'll gain 20 racks of
space, I'm not sure I can get 20 racks of cooling in place for that.
I'm reasonably sure I can power the stuff for the $90K figure and even
add sufficient generator to keep critical elements (cooling at a reduced
level; HPC generally has no requirement for running during a power
failure) to continue until a clean shutdown or power's restored.
I like what I've read on the Tilera. I think it's got some potential,
but I think it's time we consider taking our breed of HPC toward to
Maker side of things, and begin hacking minimalist motherboards,
adopting low-power devices, and generally reinvent the hardware stack as
we knew it.
Eugen Leitl wrote:
> A look at the 100-core Tilera Gx
> It's all about the network(s)
> by Charlie Demerjian
> October 29, 2009
> TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and
> while this is true, the real interesting technology is in the interconnects.
> The overall chip is quite a marvel, and it is unlike any mainstream CPU you
> have ever heard of.
> Making a lot of cores on a chip isn't very hard. Larrabee for example has 32
> Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel
> wanted to, it could put hundreds of cores on a die, that part is actually
> quite easy. Keeping those cores fed is the most important problem of modern
> chipmaking, and that part is not easy.
> Large caches, wide memory busses, ring busses on chip, stacking, and optical
> interfaces all are attempts to feed the beast. Everyone thought Intel's
> Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago,
> was about packing cores onto a die. It wasn't, it was a test of routing
> algorithms and structures. Routing is where the action is now, packing cores
> in is not a big deal.
> Routing is where Tilera shines. It has put a great deal of thought into
> getting data from core to core with minimal latency and problems. Its rather
> unique approach involves five different interconnect networks, programmable
> partitioning, accelerators, and simply tons of I/O. Together, these allow
> Tilera's third generation Tile-Gx CPUs to scale from 16 to 100 cores without
> choking on congestion. They may not have the same single-threaded performance
> of a Nehalem or Shanghai core, but they make up for it with volume.
> 100 core diagram
> Tilera 100 core chip
> The basic structure is a square array of small cores, 4x4, 6x6, 8x8 or 10x10,
> each connected via five (5) on-chip networks, and flanked by some very
> interesting accelerators. The cores themselves are a proprietary 32-bit ISA
> in the first two generations of Tilera chips, and in the Gx, it is extended
> to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and
> the memory controller now sees 64 bits as well.
> In previous generations, there was no floating-point (FP) hardware in Tilera
> products. The company strongly recommended against using FP code because it
> had to be emulated taking hundreds or thousands of cycles. With the new Gx
> series chips, FP code is still frowned upon, but there is some FP hardware to
> catch the odd instruction without a huge speed hit. The 100 core part can do
> 50 GigaFLOPS of FP which may sound like a large number, but that is only
> about 1/50th of what an ATI Cypress HD5870 chip can do.
> The majority of the new instructions are aimed at what the Tilera chips do
> best, integer calculations. Things like shuffle and DSP-like
> multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where
> these new chips shine. Basically, the Gx moves information around very
> quickly while twiddling bits here and there with integer functions.
> While the cores might not be overly complex, the on-chip busses are. Each Gx
> core has 64K of L1 cache, 32K data and 32K instruction, along with a unified
> 8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent,
> and the cache subsystem can reorder requests to other caches or DRAM. On top
> of this, the core supports cache pinning to keep often used data or
> instructions in cache. On the 100 core model, the Gx has 32MB of cache.
> Tiles are the name Tilera uses for for a basic unit of repetition. The 16
> core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core,
> the L1 and L2 caches, and something Tilera calls the Terabit Switch. More
> than anything, this switch is the heart of the chip.
> Tile diagram
> A Tilera tile
> Remember when we said that cramming 100 cores on a die is not a big problem,
> but feeding them is? The Terabit Switch is how Tilera solves the problem, and
> it is a rather unique solution. Instead of one off-core bus, there are five.
> Each of them has a dedicated purpose, and that not only gives huge bandwidth,
> it also goes a fair way towards minimizing contention. Cache traffic will
> never be stepped on by user data, and so on.
> The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two
> generations of Tilera chips, all of these networks were 32 bits wide, but on
> the Gx, the widths vary to give each one more or less bandwidth depending on
> their functions.
> QDN is called the reQuest Dynamic Network, and it is used for memory and
> cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used
> to feed memory reads back to the chips. RDN is 112 bits wide, an odd number,
> 64 + 48 from the look of it.
> FDN is the widest at 128 bits, and it is used for cache to cache transfers
> and cache coherency. Given the critical nature of cache transactions like
> this, the width is no surprise. The last two IDN and UDN are both 32 bits
> wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a
> dedicated channel for off-chip transfers, you can see that reaching
> theoretical numbers was a priority at Tilera.
> The last network UDN is for User Dynamic Network, basically the one users get
> to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping,
> they work in the background. If you want to send things from point A to point
> B, you send it across the UDN.
> Although Tilera didn't explicitly state it, each hop from router to router
> takes one cycle. This means that in a pathological case, corner core to
> memory on the far corner, it could take 19 cycles to go from request to
> memory, plus the memory round trip time, and then another 19 cycles to get
> back. That is what you call a long time in computer speak. Even in an
> 'average' case, you have a 10 cycle latency, which is very long as well.
> To be fair, the Tilera architecture is not made to run general purpose code.
> As it was described when the first generation came out, workloads are meant
> to be chunked up, so a single tile does a function, then the data gets passed
> to the next tile for more work, and so on and so forth. If your program has
> 20 steps, you use 20 tiles and pipeline the work.
> This solves many of the problems with variable latency and multi-hop traffic.
> The other more elegant solution is the ability to section off chunks of the
> chip into sub-units. There is a hypervisor that can partition each Gx chip
> into programmable blocks.
> Chunking tiles
> Sub-sections of tiles
> As you can see in the diagram above, each Gx is broken up into sub-chips in
> software. You can give each process as much CPU power as it needs, and
> arrange it so the output of one block feeds into the input of the next in a
> single clock. This example has two Apache web server instances, an intrusion
> prevention system (IPS), a secure sockets layer (SSL) stack, a network stack
> and a few other processes running next to each other.
> The Apache instances have their own memory controller, as do the IPS and the
> SSL stack. The network stack is sitting on top of the memory controller for
> decreased latency. Basically, the programmer can choose where to put each
> process to minimize latency. It doesn't take much to figure out how to apply
> these concepts to a database plus web server scenario, or a three-tiered
> SAP-like workload.
> Basically, Tilera allows you to explicitly place the data and compute
> resources where, when and how you need them. The chunks are done at roughly
> the same level as hardware VMs are in x86 CPUs, running below the level that
> a process can affect. This creates hardware walls to segregate data
> transfers, cache coherency traffic, and other tile to tile transfers. If done
> correctly, it can minimize latency a lot in addition to keeping processes
> from stepping on each other.
> Now that you know how the cores work, talk, and are partitioned, what about
> the 'uncore'? Talk about that starts with the memory controllers - four
> DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core
> models. For the keen eyed out there, this means Tilera has two different
> socket configurations, one for the 64 and 100 core chips, and another one for
> the 16 and 36 core chips.
> DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps
> per contr. Basically, this chip has a lot of available bandwidth. As you
> might imagine, on the 16 and 36 core variants, there are only half the
> controllers, so half the bandwidth.
> In addition, you have a generic controller for USB, UARTs, JTAG and I2C
> controllers. Given that Tilera chips are basically embedded, these are not
> likely to be used for much more than booting and diagnostics.
> On the core diagram above, there are two other blocks, the orange MiCA and
> mPIPE accelerators. These are where the other parts of the Tilera Gx 'magic'
> happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is
> short for multicore Programmable Intelligent Packet Engine. If it isn't
> blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O.
> The mPIPE does a lot of interesting things, all supposedly at wire speed. It
> has a programmable packet classification engine, said to be usable at 80Gbps
> or 120M packets per second. It can twiddle headers and do other evil things
> that would make Comcast drool with the potential for 'network management'
> extortion payements.
> In addition, it can also load balance across the various I/O lanes, and
> redirect tile to tile 'I/O' in a somewhat intelligent fashion. On top of
> that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep
> latencies low. Think of it as a programmable housekeeping offload engine.
> The most interesting bit is that the mPIPE can tag a packet with a 32 bit
> header before it sends it onto the internal network. This is where the
> programmable part shines. You can set up fields in the I/O packet itself to
> pass along pre-decode information and other time-saving tidbits. Since I/O is
> fully virtualizable, you could theoretically tag the packets with VM data, or
> just about anything else a bored programmer can think of.
> The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto
> offload engines. They can work either 'inline' or as ull blown offload
> engines, that is up to the programmer. The MiCA can pull data directly from
> caches or main memory without CPU overhead, basically fire and forget.
> If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi
> and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA,
> DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a
> true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be
> encrypted along with any other text that uses correct grammar. RLY.
> Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full
> duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the
> 100 core, 20K keys per second for the 36 core. Not bad at all. In addition,
> the MiCA supports a hardware compression engine that uses the tried and true
> Deflate algorithm.
> The last piece of the puzzle is something that Tilera calls external
> acceleration interfaces. This could be as simple as plugging in a PCIe card,
> but that lacks elegance. The interesting part is a field programmable gate
> array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the
> FPGA to the serial deserial unit (SerDes) to enable basically direct and low
> latency 32Gbps transfers. Direct transfers to cache and multiple contexts are
> supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip.
> In the end, you have a monster chip for I/O and packet processing. It doesn't
> do single-threaded applications all that fast, but it really isn't meant to.
> The chip itself is not out yet, nor is there even silicon yet. The first
> version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core
> later in Q4 or possibly Q1 of 2011. These both share the same socket
> configuration and a 35*35mm package.
> In Q1 of 2011, the 100 core chip will come out on a new socket and in a
> 45*45mm package. A bit after that, the 64 core will hit the market. Power
> ranges from 10W for the 16 core to 55W for the 100 core, but you can get
> power optimized variants that will only suck 35W. Given the programmability
> of the parts, power use is likely more dependent on the programs running on
> The last bit of information is clock speeds. The 64 and 100 core models will
> come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much
> there is to synchronize and keep going. The 36 core models will come in
> 1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in
> 1.0GHz or 1.25GHz versions. Given the core count, internal interconnections,
> memory and I/O capabilities, Tilera will pack a lot of power into these small
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf