[Beowulf] Checkpointing using flash
David N. Lombard
dnlombar at ichips.intel.com
Fri Sep 21 14:45:24 PDT 2012
On Fri, Sep 21, 2012 at 02:49:32PM +0000, Hearns, John wrote:
> Frequent checkpointing will of course be vital for exascale, given the MTBF of individual nodes.
Individual nodes have very good MTBF. It's /system/ scale that causes
problems for system MTBF.
Take a look at Christian Enelmann's presentation at
Our primary approach today is recovery-base resilience, a.k.a.,
checkpoint-restart (C/R). I'm not convinced we can continue to rely on that
Having written that, we can clearly improve on C/R overheads with various
techniques, including NVM. A number of papers have discussed the use of
NVM to reduce overheads so that we can continue to rely on C/R. See
these for example
> However how accurate is this statement:
> HPC jobs involving half a million compute cores ... have a series of checkpoints set up in their code with the entire memory state stored at each checkpoint in a storage node.
We're not concerned about the "entire memory state". Application-level
checkpointing only saves an application-dependent portion of the
program's data. Granted, this could still be a /large/ fraction of
Storing the checkpoint in persistent storage, but *not* "a storage node",
is one current approach. Storing in other nodes' memory, e.g., diskless
checkpoint, is another.
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.
More information about the Beowulf