[Beowulf] Is there really a need for Exascale?

Mark Hahn hahn at mcmaster.ca
Thu Nov 29 22:14:58 PST 2012

>>> At some point, light speed becomes the limiting factor, and for that,
>>> reducing physical size is important.
>> we're quite a way away from that.  I don't see a lot of pressure to
>> improve fabrics below 1 us latency (or so), ie, 1000 light-feet.
> I disagree. On-die/wafer meshes should not add more than 1-5 ns
> at each hop. Sending message should take order of magnitude
> memory access. Off die will be slower, but not much slower.

I think we're talking about different things.  I'm simply saying that 
inter-node (offboard) interconnect tends to be O(1us) and doesn't 
seem to be showing a lot of pressure to get smaller.  and what I was
trying to get at is that there's no reason to expect inter-node,
intra-node and intra-package fabrics to have the same interface.
specifically, people like shared memory, at least as it doesn't 
sap a lot of performance.  there's no reason to think SHM would be
terrible on a chip that uses an on-chip mesh - cache coherency 
scales well, but is latency-sensitive, which makes it hard to do
well with inter-chip and inter-node fabrics.  not to mention that
offchip fabrics, are hot if you try to drive them low-latency.

> This is the only way to make megacore and gigacore work.

yes, meshes are scalable inter-node, but I don't see any reason to expect
a removal of SHM as an onchip and inter-chip fabric.

> Lightweight message passing directly in hardware, with

not to go all dualistic, but store is a lightweight send...
which aspects of store would you generalize to get to something 
more conventional MP-like?  variable-length (vs cacheline sized)?

>>> Consumer gear is heading smaller, in
>>> general (viz PC mobos getting smaller over the years),
>> mainly due to integration, not anything else.  intel put cache onchip
>> because it made performance sense, not because it freed up a few
>> sq inches of motherboard.
> I've been waiting for cache to die and be substituted by
> on-die SRAM or MRAM. Yet to happen, but if it happens,
> it will be with embedded-like systems.

I'm not so sure.  caches are a very effective use of CAM;
throwing away CAM seems like a radical step.  I'm also not sure 
there's that much power savings: any memory array needs fast 
decoders: cache adds a tag and comparator, and ways cost.
with a ~64b tag and 64B line, the space overhead is modest.

perhaps I'm wrong, but I think people sometimes ignore that caches 
are a powerful way to add semantics to memory.  for instance, it's 
easy to imagine a Parallella-like chip that has per-core memory 
mapped into a global address space (along dram and other die).
a remote fetch could bring back a data packet containg tags stating
that the line is write-through, or has a particular lease.
or makes it go into L3 instead of L1, or that upon being expired
from L1, goes to L3 instead of all the way back to the home node.
since fetch is dual to receive, how about an ISA that lets you 
set up pending fetches, that when completed, do something like a 
thread fork.

More information about the Beowulf mailing list