FYI: superlinear speedups in GROMACS (fwd)

Sat Mar 9 01:29:04 PST 2002

On Fri, 8 Mar 2002, W Bauske wrote:

> That statement makes me curious. Do you mean embedded memory on chip or
> what? If it's on chip, how is it any better than cache? If not on chip,
> elaborate please on what you're describing.

This is off-topic, but with on-die memory, even DRAM has cache
characteristics, without the overhead. The idea is to put the CPU into
your memory, not to bring memory where your CPU is. There would be no
off-die memory, ideally.The CPU would have to be stripped down, and
modified (e.g. segmented) to profit from symmetries available on the die
(e.g., ability to directly address and manipulate kBit words, and do SIMD
on very long registers). You'd have to interconnect the dies with a fast
serial bus, running a packet switched protocol in hardware. Given short
distances and small geometries (in extremis on-wafer) you could achive
short message latency similiar to how long it currently takes to address a
word in memory.

Infineon is doing R&D into embedded memory processors (at least, according
to the job offers they used to post before they were slashed by the
current hardware slump), but I think currently only IBM attempts to build
a high-performance architecture on around them (Blue Gene). However,
embedded memory processors are intrinsically unsuitable for good float
performance, as the whole CPU plus router would have to fit into silicon
resources currently occupied by a single float ALU. But they're very
effective for parallel operations on arrays of short to long integers and
sequence operations (bioinformatics comes to mind, also cryptography,
lattice gas stuff, embarassingly parallel stuff, simulation, etc).

Big problem with purely embedded processors is that the memory grain size
is few MBytes for yield reasons. Pure Linux has too much redundancy for
this, but of course you could use L4/Fiasco like nanokernels on such
architectures, adding a Linux wrapper where necessary.

> I'd like to see a P4 with a GB or so of memory all on the chip. Would
> make an interesting node for what I do.

Think rather 100 nodes with 32 MBytes each, in a desktop box.

The reason this is not being done is that it's a high risk venture, as
there will be very little software for it, unless it behaves as a vanilla
cluster (even so, mainstream hasn't discovered clustering yet).