Beowulf: A theorical approach

Thu Jun 22 12:08:10 PDT 2000

On Thu, 22 Jun 2000, Lyle Bickley wrote:

> Thanks Robert for all your comments, but especially those regarding
> fault tolerance.

You're more than welcome.

> Cost/benefit analysis is a very difficult issue.  How many Beowulf runs
> that take days to complete fail?  What is the cost?  I wish I had a
> better handle on this.  It's a LOT easier to understand the cost of the
> NY Stock Exchange going down for 20 minutes than a Beowulf failure after
> three days....

I'm hoping to tackle this in a chapter in the eternal book I'm working
on.  Part of the answer is objective, and that part can be explained.
In fact, it is mathematically described by e.g. game theory or insurance
company actuarial statistics -- one is selecting a strategy to optimize
some expected return (maximize benefit or minimize cost) based on your
best guess of certain probabilities and cost weights.  There are even
ways to create a feedback correction cycle and tune to a global optimum
based on observed rates of failures and observed costs instead of
guesses, if one gets very fancy and it matters.

The other part of the answer, as you note, is subjective.  What's the
"cost" of a beowulf failure after three days?  Probably very little, if
you are in the middle of a six month project and it doesn't happen
again.  On the other hand, if you have a publication deadline in two
days and needed just one more hour to complete the three day run that
would finish things off in time to write them up...

I worry about the same thing here during the academic year.  During the
bulk of the semester a server failure in the physics department is an
annoyance, but probably isn't "critical".  Every semester, though, there
is a ten day or so period where a server failure could literally be a
disaster -- when I'm writing my final exams (on the computer) and so is
everybody else, or evaluting my gradebooks (on the computer) and so is
everybody else.  If those go away right before I was going to print out
an exam or tally up the grades, the entire academic Universe comes to an
ugly end as no final exam can be given in their one and only final exam
slot, or their failing grade doesn't get in until after they've
graduated.  Heads roll.  Angry students storm your office carrying
torches.

So, we do what we can to guard against this -- keep good backups,
architect things so there is a replacement box that could be turned into
the primary server in a few hours.  This costs some money and time but
is worth it.  On the other hand, what if there is a fire?  Can't say
that our measures are adequate for that.  Insurance for that would
involve off-site storage, and in fact I tend to do just that and try to
keep my entire CVS tree sync'd between home and work so if a (small:-)
meteor landed on the physics building tonight (when I wasn't there) my
sources and writings and papers and so forth would survive.  Even this
wouldn't help if there was a hurricane like Fran -- electricity itself
went away for more than a week at my house, and my laptop won't run that
long and I can't afford an adequate solar recharger...;-)

Backup strategy (the underlying reasoning) is basically the same as
failover strategy -- you basically determine the amount of work you are
willing to lose given scenario X and work cost/value Y and take
preventative measures with that period.  You then cross your fingers
concerning scenario Z that you can't afford to deal with.  After all,
even tandems will go down if they are vaporized in a nuclear blast.
Unless perhaps they are failover protected at sites separated by (say)
several tens of miles and a lot of EMP protection.  Military scenarios
probably require failover protection at even this level, but most of the
rest of us don't.

A lot of people doing beowulfish calculations do failover protection of
sorts without even knowing consciously that that is what they are doing.
For example, why is it a "three day run"?  In most cases, one can pick a
(scientific) calculation size that will run in an hour, or a day, or a
week, or a year (and all would yield interesting results).  You pick a
size that you can afford and that finishes in a "reasonable" amount of
time.  Larger sizes wait for Moore's Law to catch up to them.  What's
reasonable?  A size that you're pretty sure will finish before a system
is likely to fail, which may be as low as the interval between area
thunderstorms in the summer (this was the case at my house before I
installed UPS on everything).

In many cases a one can do better -- For example, it may be possible to
do a year's worth of calculation safely by breaking it into chunks
completed a day at a time, or a week at a time, without having to really
"checkpoint" the code.  In Monte Carlo, for example, one can just run a
large number of independent simulations and do stats to recombine the
results.  One even gains from doing this as the variance of the truly
independent runs is an absolutely reliable measure of error in the mean
(which isn't generally the case for the variance generated by importance
sampling a single Markov chain with internal autocorrelation times, but
I digress:-).

I personally try to time things so that chunk completion times are on
the order of one day, because I'm always willing to lose a day's worth
of compute time as long as it doesn't happen too often.  Sometimes I've
gone as high as a week.  I basically NEVER do three week long runs if
there is any way to rearrange things so I don't have to -- systems don't
break, linux rarely fails, but somehow "something" (lightning, human
error, power fluctuations, somebody tripping over a cord) not
infrequently intervenes somewhere within the timeframe of months.  This
very coarse chunking of work is all the "failover checkpointing" that I
(or, I suspect most beowulf folks) do, and it works quite effectively,
although I'm sure that it isn't always possible to coarsely chunk like
this without writing a lot of nasty code to save a truly restartable
checkpoint state...

> > At a guess, this is the kind of problem that will -- eventually -- be at
> > least partly addressed by work being done at a number of places.  I
> > believe that there is at least one group working on certain core pieces
> > of software that will build beowulf support directly into the kernel,
> > where it can benefit from increases in speed and efficiency and where
> > one can BEGIN to think about issues like fault tolerance at a lower
> > level than program design.  This is the kind of thing the "true beowulf"
> > computer science groups think about.
> > 
> 
> I have been considering the possibility of a single Tandem like system
> which is TRULY fault tolerant, bringing "true" fault tolerance to an
> entire Beowulf cluster via heartbeats, progress monitoring, process
> checkpoints, etc.
> 
> But who would buy such a critter??  Is there really a need??  What
> percentage of the total cost of a Beowulf would be a reasonable cost for
> such a beast??

Well, Tandem systems do sell, of course, so there is a market for this
kind of fault tolerance.  The military might even need it on a small
scale -- a tank might be made more robust if its battle computer was
really a fault tolerant beowulf networked to four or five hard sites
within the tank.  A non-fatal hit might take out one or two nodes, but
not the whole thing.  Ditto the space program (plagued with failures
already and with a very high cost of failure).  Financial markets and 
webservice markets both have a high cost of failure.  Something like an
EMS computer system supporting a 911 center cannot afford to go down in
any dimension, even during a natural or unnatural disaster.

In many of these cases, the people buying the fault tolerance have DEEP
pockets and the cost of failure is VERY high.  However, their needs are
also very, very specific, so one has to basically simultaneously
engineer the system and the software to match.  The one thing bringing
this sort of fault tolerance to beowulfery (at the systems level, with
open source components and COTS hardware) would do is significantly
lower the cost of the dedicated/custom software development.  I think
that is the goal of some of the folks working on the problem.

A very interesting subject, I agree.  Go for it.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu