[Beowulf] Checkpointing using flash

Fri Sep 21 09:29:17 PDT 2012

On 09/21/12 12:13, Lux, Jim (337C) wrote:
> I would suggest that some scheme of redundant computation might be more
> effective.. Rather than try to store a single node's state on the node,
> and then, if any node hiccups, restore the state (perhaps to a spare), and
> restart, means stopping the entire cluster while you recover.

I am not 100% about the nitty-gritty here, but I do believe there are 
schemes already in place to deal with single node failures.  What I do 
know for sure is that checkpoints are used as a last line of defense 
against full cluster failure due to overheating, power failure, or 
excessive numbers of concurrent failures -- not for just one node going 
belly up.

The LANL clusters I was learning about only checkpointed every 4-6 hours 
or so, if I remember correctly.  With hundred-petascale clusters and 
beyond hitting failure rates on the frequency of not even hours but 
minutes, obviously checkpointing is not the go-to first attempt at 
failure recovery.

If I find some of the nitty-gritty I'm currently forgetting about how 
smaller, isolated failures are handled now I'll report back.

Nevertheless, great ideas Jim!

Best,

ellis