[Beowulf] Checkpointing using flash

Fri Sep 21 14:45:24 PDT 2012

On Fri, Sep 21, 2012 at 02:49:32PM +0000, Hearns, John wrote:
> http://www.theregister.co.uk/2012/09/21/emc_abba/
> 
> Frequent checkpointing will of course be vital for exascale, given the MTBF of individual nodes.

Individual nodes have very good MTBF.  It's /system/ scale that causes
problems for system MTBF.
Take a look at Christian Enelmann's presentation at
http://www.csm.ornl.gov/~engelman/publications/engelmann10resilience.ppt.pdf

Our primary approach today is recovery-base resilience, a.k.a.,
checkpoint-restart (C/R). I'm not convinced we can continue to rely on that
at exascale.

Having written that, we can clearly improve on C/R overheads with various
techniques, including NVM. A number of papers have discussed the use of
NVM to reduce overheads so that we can continue to rely on C/R. See
these for example
http://dl.acm.org/citation.cfm?id=1654117
http://dl.acm.org/citation.cfm?id=1845215

> However how accurate is this statement:
> 
> HPC jobs involving half a million compute cores ... have a series of checkpoints set up in their code with the entire memory state stored at each checkpoint in a storage node.
> 
We're not concerned about the "entire memory state". Application-level
checkpointing only saves an application-dependent portion of the
program's data. Granted, this could still be a /large/ fraction of
system memory.

Storing the checkpoint in persistent storage, but *not* "a storage node",
is one current approach. Storing in other nodes' memory, e.g., diskless
checkpoint, is another.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.