[Beowulf] how Google warps your brain

Thu Oct 21 15:04:44 PDT 2010

> parallel jobs on massive datasets when you have a simple interface like
> MapReduce at your disposal. Forget about complex shared-memory or message
> passing architectures: that stuff doesn't scale, and is so incredibly brittle
> anyway (think about what happens to an MPI program if one core goes offline).

this is a bit unfair - the more honest comment would be that for
data-parallel workloads, it's relatively easy to replicate the work a bit,
and gain substantially in robustness.  you _could_ replicate the work in
a traditional HPC application (CFD, chem/md, etc), but it would take a lot 
of extra bookkeeping because the dataflow patterns are complex and iterative.

> The other Google technologies, like GFS and BigTable, make large-scale
> storage essentially a non-issue for the developer. Yes, there are tradeoffs:

well, I think storage is the pivot here: it's because disk storage is so
embarassingly cheap that Goggle can replicate everything (3x?).  once you've
replicated your data, replicating work almost comes along for free.

> So, printf() is your friend. Log everything your program does, and if
> something seems to go wrong, scour the logs to figure it out. Disk is cheap,
> so better to just log everything and sort it out later if something seems to

this is OK for data-parallel, low-logic kinds of workflows (like Goggle's).
it's a long way from being viable for any sort of traditional HPC, where
there's far too much communication and everything runs too long to log
everything.  interestingly, logging might work if  the norm for HPC clusters 
were something like gigabit-connected uni-core nodes, each with 4x 3TB disks.

so in a sense we're talking across a cultural gulf: 
disk/data-oriented vs compute/communication-oriented.