[Beowulf] Is there really a need for Exascale?

Fri Nov 30 10:57:27 PST 2012

>> absent (along with AMD :( ).  it's not clear how generally applicable
>> Cuda's SIMT programming model is.  and having it as a separate ISA
>> (versus traditional cores) is a problem, complexity-wise.
>
> In general CUDA seems to hide the hardware poorly, so
> it probably doomed long-term.

well, cuda exposes a quite different programming model;
I'm not sure they can do much about the generational differences
they expose.  (say, generational differences in what kind of atomic
operations are supported by the hardware.)  many of the exposures
are primarily tuning knobs (warp width, number of SMs, cache sizes,
ratio of DP units per thread.)  a very high-level interface like 
OpenACC doesn't expose that stuff - is that more what you're looking for?
there's no denying that you get far less expressive power...

>> stacking is great, but not that much different from MCMs, is it?
>
> Real memory stacking a la TSV has smaller geometries, way more
> wire density, lower power burn, and seems to boost memory bandwidth
> by one order of magnitude

sorry, do you have some reference for this?  what I'm reading is that 
TSV and chip-on-chip stacking is fine, but not dramatically different
from chip-bumps (possibly using TSV) connecting to interposer boards.
obviously, attaching chips to fine, tiny, low-impedence, wide-bus 
interposers gives a lot of flexibility in designing packages.

> http://nepp.nasa.gov/workshops/etw2012/talks/Tuesday/T08_Dillon_Through_Silicon_Via.pdf

that's useful, thanks.  it's a bit high-end-centric - no offence, but 
NASA and high-volume mass production are not entirely aligned ;)

it paints 2.5d as quite ersatz, but I didn't see a strong data argument.
sure, TSVs will operate on a finer pitch than solder bumps, but 
the xilinx silicon interposer also seems very attractive.  do you 
actually get significant power/speed benefits from pure chip-chip
contacts versus an interposer?  I guess not: that the main win is 
staying in-package.

it is interesting to think, though: if you can connect chips with 
extremely wide links, does that change your architecture?  for instance,
dram is structured as 2d array of bit cells that are read out into a 
1d slice (iirc, something like 8kbits).  cpu r/w requests are satisfied
from within this slice faster since it's the readout from 2d that's 
expensive.  but suppose a readout pumped all 8kb to the cpu - 
sort of a cache line 16x longer than usual.  considering the proliferation
of 128-512b-wide SIMD units, maybe this makes perfect sense.  this would 
let you keep vector fetches from flushing all the non-vector stuff out
of your normal short-line caches...