[Beowulf] commercial clusters

Fri Sep 29 10:49:48 PDT 2006

> very well taken.  There are an enormous number of people who could use "big 
> computation" if it were "easy to use" and "cheap enough".  $10K is a

maybe.  to me, if dell started selling $10k windows-cluster-in-a-box
that was really at the windows-drooler level, it would be a huge shame.
vast amounts of truely crappy jobs would be run, and vast amounts of 
cycles would be burned on the screen-saver.  I'm not arguing against
either the dumbing down into compute-appliances (where appropriate),
or against people spending their money as they see fit.  I just think
there is massive value in centralized compute facilities because users
can get programming help, professionally managed hardware, cheaper cycles 
and efficient interleaving of multiple user's bursts of demand.

admittedly, that is precisely the model I've spent my last 5 years on,
but I think it still makes sense.  it's not the only model, but it has 
some real advantages over others such as the one mentioned above, or 
the grid fantasy (fungible computation too cheap to meter).

but one of the other points in this thread was the current fad of
multi-cores.  let's face it - it's a fad, which doesn't imply that 
there's no substance driving it, or that it will vanish without a trace.
CPU designers are facing an embarassment of transistors at 65 or 45nm,
and a relatively no-thought way to use em up is just to replicate the 
same old design.  I _do_ mean to imply that this is a cop-out, and I 
really do believe that once we get to 4-core chips, someone will embarass
the other chip vendors by implementing a genuinely thoughtful 
microarchitectural response to the transistor surplus.  caches are great,
but scaling them to just use up the chip area is not smart.  relying 
on that approach is saying: I bet no one else in the industry is smart
enough to think of something better.

come on!  I'm not even a chip designer and I can think of lots of smarter
things to do.  create a "load-history cache" which, like a branch-history
cache, tries to figure out whether there's a predictable stream of loads
coming from one instruction.  provide an instruction which lets the 
programmer/compiler hint how many times to speculatively unroll a loop
(literature says there are plenty of useful speculation, and that the 
trick is to avoid drowning in it).  figure out an on-chip fabric that 
lets you have lots of independent register files without dumbly 
partitioning the chip into cores, since static partitioning always leads 
to fragmentation and poor utilization.  have a smarter cacheline
that will notice that it only gets re-used an average of 3 times in the 
57 clocks following enstatement, and so shifts itself into L3 proactively.
or notice that some code sequences are relatively urgent (dependent)
and others are pretty slack (speculatively unrolled iterations of a loop,
perhaps), so schedule them smarter.  how about a miss-history buffer that 
notices when you write a value to a non-owned line that later gets moved
to other cores and becomes shared, so preemptively updates them.

most of these ideas are crazy in one way or another, but they're a lot 
more interesting than more cookie-cutter chips...

and fundamentally, Amdahl's law argues against too-rabid multi-coring.