[Beowulf] core diameter is not really a limit

Tue Jun 18 04:10:32 PDT 2013

On Mon, Jun 17, 2013 at 05:49:25PM -0400, Mark Hahn wrote:
> On Mon, 17 Jun 2013, Eugen Leitl wrote:
> <interesting dreams of nanocomputers elided - Charles Stross's novels are 
> entertaining extrapolations based on this kind of thing...>

Well, molecular circuitry is a lot closer now than in
1970s, now especially given that Moore's law has run 
into cost limits (and soon physical limits) and the 
only way forward is to go up -- into the third dimension, 
first as multilayer, and then as real autoassembled 
crystal from virus capsid-sized components.

Actually, Moore's cost limits will bite especially for
embedded and also exascale, given that it becomes a
matter of not just of power dissipation, but also
cost per unit, if you have millions and billions of
SoC nodes.

> >> everywhere.  OTOH, the idea of putting processors into memory has always
> >> made a lot of sense to me, though it certainly changes the programming
> >> model.  (even in OO, functional models, there is a "self" in the program...)
> >
> > Memory has been creeping into the CPU for some time. Parallella e.g.
> > has embedded memory in the DSP cores on-die.
> 
> well, they have a tiny bit of memory per core - essentially software-managed,
> globally-addressed cache-per-core.  it shouldn't make you sit up and go

It doesn't make me sit up, I've been waiting for that kind of thing for
many years. It's still nice that it happens, even if the designers still think
it's a gamble.

I don't think it's a gamble long-term: silicon real estate is limited,
and we *will* have to move the data and instructions to where the
actual processing happens.

By eliminating coherent cache and going for a memory mapped space
that somewhen between register and cache in access latency, and
can have considerable native bus and ALU width since on-die. 

You don't have quite this leverage with TSV stacked memories.

The lack of hardware management is a plus, actually, since the
OS knows at all time what's up, and what needs to be done.
There needs to be awareness at the compiler level, as few
assume several k or M of registers, especially really wide ones.
Looks like a good fit for OpenCL here.

> "Hmm!".  I think it's more interesting to ponder the fact that there have 
> always been some small experiments with putting (highly data-parallel)
> processing onto the dram chip itself.  I mean, dram is fundamental: chips 

Yes, I'm aware, and I think it's an intermediate stage between real
cellular hardware.

> will be planar for a long time, therefore density demands a 2D storage array.
> so a row decoder will read out a few Kb.  why not perform some data-parallel
> operations row-wise, on the dram chip itself: you've got the row there anyway.

You can move at least some wide-bus nonfloat (but integer vector) processing
into DRAM, only at very little additional costs. Integrated corrective
integrity checking would be nice, BitBlt-like processing built-in is
nice, parallel searches, distributed GC, some GPU-like processing, all
these things are doable, assuming the processing model and standard APIs
follow.

> > Hybrid memory cube is
> > about putting memory on top of your CPU.
> 
> this is just a slight power optimization: drive shorter wires.
> I'm looking forward to 2.5D integration, but it's evolutionary...

Technically, current CPUs are already multilayer-enough so that
they almost qualify for 2.5D, that's the reason it's hard to make
really brilliant memories. 

> > is mixing memory/CPU, even though that is currently problematic in
> > the current fabrication processes.
> 
> I'm not sure how much blame can be attributed to the nature of processes
> specialized to cpu vs dram.  at one time this was obvious: cpus on fast but 
> high-leakage process being almost the perfect opposite of low-leakage dram.

I understand there are still considerable, and growing process complexity
differences between DRAM and CPU production.

> but leakage has been a cpu issue for a long time now.  there even appears
> to be some interesting convergence, with 3d/finfet transistor tech being 
> used for dram arrays.  my guess is that preferences for say, doping levels

So, so they're increasing complexity in DRAM as well, due to space 
constraints. Interesting. I wonder how complex the APU processes
are getting.

> or oxide thickness do *not* form permanently conflicting fab constraints.

I would really like to see an MRAM/CPU hybrid, with fully reconfigurable
logic, even at runtime.

> > The next step is something like
> > a cellular FPGA,
> 
> yeah, no.  I don't actually think things will go in that direction, at least 
> not for a long time, mainstream-wise.  but will we see systems that look like 
> big grids of dimm-like pieces?  yes: processor-in-memory, not merely memory
> organs supporting a distant, separate processor "brain"...

We've got FPGA with attached ARM cores in SoCs already (Parallella again),
but we still haven't got smart memories shipping. The mainstream at times
takes decades to follow up on a promising path.

> in some sense, the real question is how much of your system state is active
> at any time.  computers are traditionally based on the assumption that most 
> data is passively stored most of the time, and that we occasionally take out
> some bits, mutate them, possibly store new versions.  Eugen is talking about

I think that model will become less important in future, simply because 
if you can't grow your number of switches at will as it was possible, so
you have a larger fraction of it cranking in order to make faster systems.
Ideally, all switches can flip the next instant, which is where you've
arrived in the crystalline hardware model. However, that would be probably
power-dissipation prohibitive in CMOS, so we have to wait until spintronics
for that (which doesn't burn energy until you need a bit flipped, and
it's a great long way to the Landauer limit yet).

> more of a stream-processing model, where there is limited passive state - 
> ie, other than the state interlock between pipeline/cellular stages.  I think
> we'll continue to have lots of passive, non-dynamic state, so our
> architectures will still be based on random access to big arrays.
> (dram, disk, flash, whatever.)

I hope disks will die. I still wonder why we're not getting
cheap PCIe flash memory directly mmapped into the address space.
That SATA/SAS thing is no longer helping us there.