[Beowulf] Checkpointing using flash
Ellis H. Wilson III
ellis at cse.psu.edu
Fri Sep 21 09:29:17 PDT 2012
On 09/21/12 12:13, Lux, Jim (337C) wrote:
> I would suggest that some scheme of redundant computation might be more
> effective.. Rather than try to store a single node's state on the node,
> and then, if any node hiccups, restore the state (perhaps to a spare), and
> restart, means stopping the entire cluster while you recover.
I am not 100% about the nitty-gritty here, but I do believe there are
schemes already in place to deal with single node failures. What I do
know for sure is that checkpoints are used as a last line of defense
against full cluster failure due to overheating, power failure, or
excessive numbers of concurrent failures -- not for just one node going
belly up.
The LANL clusters I was learning about only checkpointed every 4-6 hours
or so, if I remember correctly. With hundred-petascale clusters and
beyond hitting failure rates on the frequency of not even hours but
minutes, obviously checkpointing is not the go-to first attempt at
failure recovery.
If I find some of the nitty-gritty I'm currently forgetting about how
smaller, isolated failures are handled now I'll report back.
Nevertheless, great ideas Jim!
Best,
ellis
More information about the Beowulf
mailing list