updating the Linux kernel

Mon Jun 12 10:40:57 PDT 2000

On Mon, 12 Jun 2000, David Lombard wrote:

> Crutcher Dunnavant wrote:
> > 
> > Now, I might completly miss something here, but shouldn't all *distibuted*
> > parallel programs assume that a node may not return. After all, what do you
> > assume about hardware failures? ...
> 
> Um, no.
> 
> It all depends upon the software.  PVM does provide the ability to
> recover from a node failure, while an MPI program will just tank.

> > ... So, while it may not be a *good* way to do it,
> > In a properlly paralized application, shouldn't you be able to take down any
> > random node other than the job allocation node, AT ANY TIME, and have that job
> > reallocated.> reallocated (Yeah, you lose the local work, but those tasks should be
> > checkpointed frequently)...
> 
> As for checkpointing, that too is an "it depends" answer. 
> Application-level checkpointing may be available to varying degrees --
> it can be a non-trivial task.  System-level checkpointing generally
> can't handle sockets, and that rules out both PVM and MPI.

Depends indeed, and on hardware as much (or even more) than the software
per se.  In many, if not most cases, checkpointing in particular (or any
other means of ensuring a fully robust capability of a parallel program
to recover from a node failure) can be mathematically shown to
>>increase<< on the average, the time to completion, possibly
considerably.  This is therefore a decision that should be left up to
the developer after considering their own particular task(s), parallel
environment, and associated scaling properties.

This is purely a statistical cost-benefit problem.  In many/most cases
the additional (expectation value of the) costs of developing and
running a fully node-robust parallel application will outweigh the
(expectation value of the) benefits and it should NOT be done;
checkpointing code (parallel or otherwise) is not something done either
lightly or as standard practice because "usually" the probability of a
failure that might be "rescued" by the checkpoint is small and the cost
of the checkpoint(s) large.  

In others, either nonlineariites in the perceived benefits (for example,
in a distributed database supporting 911 operations for a
police/fire/emergency group where the "cost" of failure might be a human
life) or a statistical analysis of the expected time of completion make
checkpointing desirable, or even essential if you with to complete the
calculation at all (have an expected time of completion much less than
infinity).

To do the decision analysis (checkpoint/don't checkpoint -- make code
node robust or not) correctly, one needs a set of parameters.  From
these parameters one can see roughly how to make the decision.  One
parameter -- the additional time required to develop a robust
application -- is semi-heuristic. Making an application robust is silly
if it takes you weeks to develop it and you plan to use it only one time
and it will complete in at most a few days even if you have to deal with
a few node failures.  

The others are (as a function of number of nodes N):

  a) The time required for the calculation to complete normally on N
nodes T_calc(N).  
  b) The time required to checkpoint the calculation (e.g.  store all
active memory onto a RAID server) T_save(N)
  c) The time interval selected between checkpoints T_check (which is a
parameter chosen optimally as a function of N, but which doesn't depend
directly on N).
  d) The probability density (probability per unit time) of node failure
p.  With luck one can assume that node failures are independent events
and that all nodes have the same probability density (presuming that
node design is homogeneous and that nodes have independent UPS and so
forth).

In a nutshell, if one massages these numbers (several of which are
themselves highly nonlinear functions of e.g. the number of nodes N and
other hardware design parameters) and guestimates that the probability
of overall node-based failure of a calculation is very small in the time
T_calc(N), the estimated time of completion is unlikely to be
significantly reduced by checkpointing and indeed may be very
significantly increased if the time cost of checkpointing itself,
T_save(N)*T_calc(N)/T_check, is at all comparable to T_calc(N).

If the overall probability of a node-based failure is "high" (above some
threshold clearly related to the additional time cost of checkpointing)
checkpointing is worth it if the additional time required to develop the
code isn't unreasonable compared to the amount of time one expects to
use the program.  In some cases (for example, on a cluster based on an
unstable and unnamed operating system that might have one node crash per
day in a thirty-node cluster, while the parallel calculation takes at
least a week to complete) code that is NOT checkpointed will simply
"never" complete.

I would therefore NOT make auto-migration, robustness under node failure
or checkpointing a "required feature of well-written parallel programs",
or even the much weaker "advised/nice/suggested feature".  Sometimes
this sort of bet-hedging is appropriate.  Sometimes it is essential.
Sometimes it would be idiotic.  Only the programmer/designer, with a
clear picture of the scaling properties and probabilities of failure for
their own application and cluster hardware, is in a position to
determine which category >>their<< particular application/cluster falls
into.

BTW, this sort of tradeoff occurs in a lot of other areas in
computational optimization design -- e.g. TCP (robust against packet
loss) vs UDP (not so robust against packet loss but somewhat faster).
There are also plenty of analogs in quality control in manufacturing and
so forth.  Statistical cost/benefit analysis rules, and good programmers
should do (and hence should be able to do) at least seat-of-the-pants
estimates as standard practice in code design. IMHO, anyway...;-)

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu