[Beowulf] Checkpointing using flash

Justin YUAN SHI shi at temple.edu
Sat Sep 22 03:42:35 PDT 2012


Ellis:

If we go to a little nitty-gritty detail view,  you will see that
transient faults are the ultimate enemies of exacscale computing. The
problem, if we really go to the nitty-gritty details, stems from
mismatch between the MPI assumptions and what the OSI model promises.

To be exact, the OSI layers 1-4 can defend packet data losses and
corruptions against transient hardware and network failures. Layers
5-7 provides no protection. MPI sits on top of layer 7. And it assumes
that every transmission must be successful (this is why we have to use
checkpoint in the first place) -- a reliability assumption that the
OSI model have never promised.

In other words, any transient fault while processing the codes in
layers 5-7 (and MPI calls) can halt the entire app.

Justin



On Fri, Sep 21, 2012 at 12:29 PM, Ellis H. Wilson III <ellis at cse.psu.edu> wrote:
> On 09/21/12 12:13, Lux, Jim (337C) wrote:
>> I would suggest that some scheme of redundant computation might be more
>> effective.. Rather than try to store a single node's state on the node,
>> and then, if any node hiccups, restore the state (perhaps to a spare), and
>> restart, means stopping the entire cluster while you recover.
>
> I am not 100% about the nitty-gritty here, but I do believe there are
> schemes already in place to deal with single node failures.  What I do
> know for sure is that checkpoints are used as a last line of defense
> against full cluster failure due to overheating, power failure, or
> excessive numbers of concurrent failures -- not for just one node going
> belly up.
>
> The LANL clusters I was learning about only checkpointed every 4-6 hours
> or so, if I remember correctly.  With hundred-petascale clusters and
> beyond hitting failure rates on the frequency of not even hours but
> minutes, obviously checkpointing is not the go-to first attempt at
> failure recovery.
>
> If I find some of the nitty-gritty I'm currently forgetting about how
> smaller, isolated failures are handled now I'll report back.
>
> Nevertheless, great ideas Jim!
>
> Best,
>
> ellis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list