[Beowulf] Interesting google server design

Sat Apr 4 08:00:54 PDT 2009

On 4/2/09 3:41 AM, "Robert G. Brown" <rgb at phy.duke.edu> wrote:

> On Wed, 1 Apr 2009, Ellis Wilson wrote:
> 
> Because there is zero penalty to node failure other than this fifteen
> minutes of human time and maybe a spare part distributed along an
> assembly line that handles (I'm guessing) tens of failures an hour,
> there is absolutely no advantage to them in using tier 1 parts.  All
> they care about is that the stack of parts they reach for is at the
> sweet spot of MTBF per dollar spent per unit of "service" delivered by
> the device.  All beowulfers should take note -- this is a perfect
> exemplar of a principle all cluster builders should use,
> 
> 
> 
> although of
> course for different problems the optimization landscape will differ as
> well (some problems are NOT tolerant of single node failure:-).

Is this not one of the big challenges in writing/designing good parallelized
applications? If you're so tightly coupled that a hiccup anywhere throws of
the timing, what you've built is basically a systolic array or a flavor of
SIMD, and you've put yourself in a category of "brittleness"

I mean, if there's a bug in the underlying rgbbot code, then if /dev/rgbbot0
dies, so does /dev/rgbbot1, etc.