Two heads are better than one! :)

Fri Nov 1 04:29:24 PST 2002

On Fri, 2002-11-01 at 00:01, Donald Becker wrote:

[...]

> >   It is more complex than that, in that you would need to preserve state
> > changes over the length of the program, and PVM/MPI/et al do not
> > preserve this state information.
> 
> One rule of thumb: people that application-independent checkpointing is
> possible haven't actually considered the implementation and
> implications.  In real life the most practical way to handle the issue is
>   - having the system handle checkpoint signal support
>   - making it easy to write, gather and restore checkpoint files, and
>   - providing examples of application-supported checkpointing

Agreed.  This is how the SGI checkpointing work, which was (IIRC)
modeled on the Cray checkpointing.  Not everything could be checkpointed
though, and the system code walked through its checkoff list of items to
see if the program was indeed checkpointable.

It is not just program state that needs to be maintained, but dynamic
systems (pipes, open files, sockets) that need to examined, torn down
and rebuilt.  Only for certain subsets of these can you successfully
rebuild after a tear down, which is why the checkpointing only worked in
some cases.

> 
> > The folks at LANL had a fault tolerant MPI at one point, but I haven't
> > heard much of it recently.
> 
> I would like to see a paper on the real-life result.  I'm guessing that
> the overhead overwhelms any possible saving even with frequent node
> failure.  That's exactly the sort of result that makes for a useful
> paper -- "You must have a much better idea than this, or it won't work."

They had something published on the web site about 1 year ago.  If I
find it again, I'll post the URL.

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615