Beowulf Questions

Sun Jan 5 13:13:52 PST 2003

On 5 Jan 2003, Randall Jouett wrote:
> On Sat, 2003-01-04 at 11:58, Donald Becker wrote:
> > 
> > Not at all!  MPI does not handle faults.  Most MPI applications just
> > fail when a node fails.  A few periodically write checkpoint files, and
> > a subset ;-) of those can be re-run from the last checkpoint.
> 
> Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte

Application-specific checkpoint files are sometimes the only effective
way to handle node crashes.

> Off the top of my head, why couldn't you just plug in an old
> 10Base-T card to each node. Add a server node that specifically

The problem isn't just detecting that a node has failed (which is either
trivial or impossible, depending on your criteria), the problems are
  - handling a system failure during a multiple-day run.
  - handling partially completed work issued to a node
  - killing processes/nodes that you are think have failed, lest they
    complete their work later.

> BTW, has anyone bothered to calculate all the wasted cycles
> used up by check-point files? :^).

Checkpointing is very expensive, and most of the time the checkpoint
isn't used. This is why only application-specific checkpoinging makes
sense: the application writer known which information is critical, and
when everything is consistent.  Machines that save the entire memory
space have been known to take the better part of an hour to roll out a
large job.

> > Although the program completes the rendering, there is still much
> > ugliness when a partially-failed MPI program tries to finish.
> 
> Hmmm. Why aren't folks flagging the node as dead and ignoring
> any other output until the node is back up and saying it's
> ready to run. This would have to be verified by the sysadmin,
> of course.

The issue is the internal structure of the MPI implementation: there is
no way to say "I'm exiting sucessfully even though I know some processes
might be dead."  Instead what happens is that the library call waits
around for the dead children to return.

This brings us back to the liveness test for compute nodes.  When do we
decide that a node has failed?  If it doesn't respond for a second?  A
transient Ethernet failure might take up to three seconds to restart
link (typical is 10 msec.)  Thirty seconds? A machine running a 2.4
kernel before 2.4.17 might take minutes to respond when recovering from
a tempory memory shortage, but run just fine later.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993