[Beowulf] Is there really a need for Exascale?

Fri Nov 30 12:25:24 PST 2012

On Fri, Nov 30, 2012 at 01:57:27PM -0500, Mark Hahn wrote:

> well, cuda exposes a quite different programming model;
> I'm not sure they can do much about the generational differences
> they expose.  (say, generational differences in what kind of atomic
> operations are supported by the hardware.)  many of the exposures
> are primarily tuning knobs (warp width, number of SMs, cache sizes,
> ratio of DP units per thread.)  a very high-level interface like OpenACC 
> doesn't expose that stuff - is that more what you're looking for?
> there's no denying that you get far less expressive power...

Absolutely. CUDA is a lot like assembler that way, and assembler
has been almost completely displaced by low-level but hardware-independant
languages like C. 

You can't tune as much in OpenCL, but on the other hand, you
don't have to. The achievable performance is lower, but more
uniform across diverse platforms. The JIT knows the hardware,
so that you don't have to. Parallelism is hard enough as
is, so nothing wrong with a little set of training wheels.

>>> stacking is great, but not that much different from MCMs, is it?
>>
>> Real memory stacking a la TSV has smaller geometries, way more
>> wire density, lower power burn, and seems to boost memory bandwidth
>> by one order of magnitude
>
> sorry, do you have some reference for this?  what I'm reading is that  

Not really. I'm increasingly out of my depth here, and happy
to be able to learn from you. I have theoretical reasons to
believe that TSV is the next best thing to real 3d integration,
though already off-Moore, since assembled from discrete
components, which are on-Moore (but Moore has recently
ended, anyway).

This is corraborated from ad hoc Googling, so I'm happy if
we'll be able to get to those promised 10^5 via density/die eventually.
There's really no other way to feed these kilocores/die we'll
be getting other than by a very wide bus to memory. 

Eventually, if MRAM can be deposited direclty on top of 
logic you'll effectively have a >10^9 wide bus to your memory.

> TSV and chip-on-chip stacking is fine, but not dramatically different
> from chip-bumps (possibly using TSV) connecting to interposer boards.

A6 has about 8.5 GByte/s memory bandwidth, while Micron demonstrated
128 GByte/s. That will feed a reasonably powerful GPU already, so
it should be more than enough for these ARM GPUs.

> obviously, attaching chips to fine, tiny, low-impedence, wide-bus  
> interposers gives a lot of flexibility in designing packages.
>
>> http://nepp.nasa.gov/workshops/etw2012/talks/Tuesday/T08_Dillon_Through_Silicon_Via.pdf
>
> that's useful, thanks.  it's a bit high-end-centric - no offence, but  
> NASA and high-volume mass production are not entirely aligned ;)
>
> it paints 2.5d as quite ersatz, but I didn't see a strong data argument.
> sure, TSVs will operate on a finer pitch than solder bumps, but the 
> xilinx silicon interposer also seems very attractive.  do you actually 
> get significant power/speed benefits from pure chip-chip
> contacts versus an interposer?  I guess not: that the main win is  
> staying in-package.

I've seen some ~um scale polymer fiber which can easily link
adjacent ~10 um spaced dies >5 TBit/s so there's plenty of air 
in the interconnect space still.

http://www.heise.de/newsticker/meldung/Optische-Chip-zu-Chip-Verbindung-mit-Polymerfasern-1715192.html?view=zoom;zoom=1
http://www.heise.de/newsticker/meldung/Optische-Chip-zu-Chip-Verbindung-mit-Polymerfasern-1715192.html

> it is interesting to think, though: if you can connect chips with  
> extremely wide links, does that change your architecture?  for instance,
> dram is structured as 2d array of bit cells that are read out into a 1d 
> slice (iirc, something like 8kbits).  cpu r/w requests are satisfied
> from within this slice faster since it's the readout from 2d that's  
> expensive.  but suppose a readout pumped all 8kb to the cpu - sort of a 
> cache line 16x longer than usual.  considering the proliferation
> of 128-512b-wide SIMD units, maybe this makes perfect sense.  this would  
> let you keep vector fetches from flushing all the non-vector stuff out
> of your normal short-line caches...

Long time ago, when I was young and even more stupid than today
I wrote http://www.enlight.ru/docs/arch/uliw.txt 
which I think has aged quite well.