[Beowulf] Rackable / SGI

Sat Apr 4 13:46:06 PDT 2009

Eugen Leitl wrote:
> On Fri, Apr 03, 2009 at 01:32:13PM -0700, Greg Lindahl wrote:
> 
>>> Will have to do with embedded memory or stacked 3d memory a la
>>> http://www.cc.gatech.edu/~loh/Papers/isca2008-3Ddram.pdf
>> We've been building bigger and bigger SMPs for a long time, making
>> changes to improve the memory system as needed. How is multicore any
> 
> Off-die memory bandwidth and latency are limited, so many codes
> start running into memory bottlenecks even at moderate number
> of cores (quad-cores seem to be a sweet spot).

Limited?  Seems pretty constant to me.  10 years ago 1 GB/sec per core/CPU was
available on the high end (like the i7 is today).  In the last 3 doubles in
the threads per socket the memory per thread/core has stayed in the 1 to
3GB/sec per range.  Today's I7 is just south of 3GB/sec per thread (8 threads
and 22GB/sec per core)

So because of the difficulty of using more than 1-3GB/sec with a single thread
caused by failures in branch prediction, memory latency, and related that the
market isn't willing to pay for more bandwidth since the number of
applications that benefit shrinks as the bandwidth / thread ratio grows.

In other markets where they have found a value in more bandwidth like GPUs
consumers can buy qty 1 video cards with 1GB ram, a GPU, motherboard, and
video out with a 160GB/sec memory system for $375.

So for the next few doubles in the cores/threads per socket I'd expect CPUs to
follow the GPUs, the low hanging fruit would seem to be GDDR5 which has double
the bandwidth per pin.  Fortunately GPUs are leading the way, they already
have the next 3 doubles in bandwidth mapped out for us.  Certainly at some
point it will require CPU and ram sharing the same die so that you can have a
8 kbit wide memory bus.