[Beowulf] Checkpointing using flash

Mon Sep 24 05:34:50 PDT 2012

Regardless how low MPI stack goes, it has never "punched" through the packet retransmission layer. Therefore, the OSI model serves as a template to illustrate the point of discussion. 

Justin

On Sep 22, 2012, at 10:34 AM, "Lux, Jim (337C)" <james.p.lux at jpl.nasa.gov> wrote:

> I see MPI as sitting much lower (network or transport, perhaps)
> 
> Maybe for this (as in many other cases) the OSI model is not an
> appropriate one.
> That is, most practical systems have more blending between layers, and
> outright punching through. There are a variety of high level
> protocols/algorithms that can make effective use of current (and
> predicted) low level behavior to optimize the high level behavior.
> 
> In any case, OSI is more of a conceptualization tool when looking at a
> system.
> 
> 
> But I agree that transient faults (whether failure that succeeds on retry
> (temporal redundancy), or failure that prompts use of a redundant unit
> (spatial redundancy) are what you need to deal with.
> 
> There is a very huge literature on this, for all sorts of scenarios.. The
> Byzantine Generals problem is a classic example.  Synchronization of time
> in a system is another.
> 
> The challenge is in devising generalized software approaches that can
> effectively make use of redundancy (in whatever form).  By effective, I
> mean that the computation runs in the same amount of time (or consumes the
> same amount of some other resource) regardless of the occurrence of some
> number of failures. From an information theory standpoint, I think that
> means you MUST have redundancy, and the trick is efficient use of that
> redundancy.
> 
> For communications channels, we are at the point where coding can get you
> within hundredths of a dB of the Shannon limit.
> 
> For algorithms, not so much.  We've got good lossless compression
> algorithms, which is a start.  You remove redundancy from the input data
> stream, reducing the user data rate, and you can make use of the "extra"
> bandwidth to do effective coding to mitigate errors in the channel.
> 
> However, while this compress/error correcting code/decompress is more
> reliable/efficient, it does have longer latency (Shannon just sets the
> limit, and assumes you have infinite memory on both ends of the link).
> 
> So in a computational scenario, that latency might be a real problem.
> 
> 
> 
> 
> On 9/22/12 3:42 AM, "Justin YUAN SHI" <shi at temple.edu> wrote:
> 
>> Ellis:
>> 
>> If we go to a little nitty-gritty detail view,  you will see that
>> transient faults are the ultimate enemies of exacscale computing. The
>> problem, if we really go to the nitty-gritty details, stems from
>> mismatch between the MPI assumptions and what the OSI model promises.
>> 
>> To be exact, the OSI layers 1-4 can defend packet data losses and
>> corruptions against transient hardware and network failures. Layers
>> 5-7 provides no protection. MPI sits on top of layer 7. And it assumes
>> that every transmission must be successful (this is why we have to use
>> checkpoint in the first place) -- a reliability assumption that the
>> OSI model have never promised.
>> 
>> In other words, any transient fault while processing the codes in
>> layers 5-7 (and MPI calls) can halt the entire app.
>> 
>> Justin
>> 
>> 
>> 
>> On Fri, Sep 21, 2012 at 12:29 PM, Ellis H. Wilson III <ellis at cse.psu.edu>
>> wrote:
>>> On 09/21/12 12:13, Lux, Jim (337C) wrote:
>>>> I would suggest that some scheme of redundant computation might be more
>>>> effective.. Rather than try to store a single node's state on the node,
>>>> and then, if any node hiccups, restore the state (perhaps to a spare),
>>>> and
>>>> restart, means stopping the entire cluster while you recover.
>>> 
>>> I am not 100% about the nitty-gritty here, but I do believe there are
>>> schemes already in place to deal with single node failures.  What I do
>>> know for sure is that checkpoints are used as a last line of defense
>>> against full cluster failure due to overheating, power failure, or
>>> excessive numbers of concurrent failures -- not for just one node going
>>> belly up.
>>> 
>>> The LANL clusters I was learning about only checkpointed every 4-6 hours
>>> or so, if I remember correctly.  With hundred-petascale clusters and
>>> beyond hitting failure rates on the frequency of not even hours but
>>> minutes, obviously checkpointing is not the go-to first attempt at
>>> failure recovery.
>>> 
>>> If I find some of the nitty-gritty I'm currently forgetting about how
>>> smaller, isolated failures are handled now I'll report back.
>>> 
>>> Nevertheless, great ideas Jim!
>>> 
>>> Best,
>>> 
>>> ellis
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>