[Beowulf] ARM cpu's and development boards and research

Tue Nov 27 23:17:37 PST 2012

> What Bill has just described is known as an "Amdahl-balanced system",
> and is the design philosophy between the IBM Blue Genes and also
> SiCortex. In my opinion, this is the future of HPC. Use lower power,
> slower processors, and then try to improve network performance to reduce
> the cost of scaling out.

"small pieces tightly connected", maybe.  these machines offer very nice
power-performance for those applications that can scale efficiently to 
say, tens of thousands of cores.  (one rack of BGQ is 32k cores.)

we sometimes talk about "embarassingly parallel" - meaning a workload
with significant per-core computation requiring almost no communication. 
but if you have an app that scales to 50k cores, you must have a very, 
very small serial portion (Amdahl's law wise).  obviously, 
they put that 5d torus in a BGQ for a reason, 
not just to permit fast launch of EP jobs.

I don't think either Gb or IB are a good match for the many/little
approach being discussed.  SiCortex was pretty focused on providing 
an appropriate network, though the buying public didn't seem to 
appreciate the nuance.

IB doesn't seem like a great match for many/little: a lot of cores 
will have to share an interface to amortize the cost.  do you provide 
a separate intra-node fabric, or rely on cache-coherece within a node?
Gb is obviously a lot cheaper, but at least as normally operated is 
a non-starter latency-wise.  (and it's important to realize that latency
becomes even more important as you scale up the node count, giving each
less work to do...)

> Essentially, you want the processors to be
> *just* fast enough to keep ahead of the networking and memory, but no
> faster to optimize energy savings.

interconnect is the sticking point.

I strongly suspect that memory is going to become a non-issue.  shock!
from where I sit, memory-per-core has been fairly stable for years now
(for convenience, let's say 1GB/core), and I really think dram is going
to get stacked or package-integrated very soon.  suppose your building
block is 4 fast cores, 256 "SIMT" gpu-like cores, and 4GB very wide dram?
if you dedicated all your pins to power and links to 4 neighbors, your
basic board design could just tile a bunch of these.  say 8x8 chips on 
a 1U system.

> The Blue Genes do this incredibly well, so did SiCortex, and Seamicro
> appears to be doing this really well, too, based on all the press
> they've been getting.

has anyone seen anything useful/concrete about the next-gen system
interconect fabrics everyone is working on?  latency, bandwidth,
message-throughput, topology?

> With the DARPA Exascale report saying we can't get
> to Exascale with current power consumption profiles, you can bet this
> will be a hot area of research over the next few years.

as heretical as it sounds, I have to ask: where is the need for exaflop?
I'm a bit skeptical about the import of the extreme high end of HPC - 
or to but it another way, I think much of the real action is in jobs
that are only a few teraflops in size.  that's O(1000) cores, but you'd
size a cluster in the 10-100 Tf range...