I see MPI as sitting much lower (network or transport, perhaps)

Maybe for this (as in many other cases) the OSI model is not an
appropriate one.
That is, most practical systems have more blending between layers, and
outright punching through. There are a variety of high level
protocols/algorithms that can make effective use of current (and
predicted) low level behavior to optimize the high level behavior.

In any case, OSI is more of a conceptualization tool when looking at a

But I agree that transient faults (whether failure that succeeds on retry
(temporal redundancy), or failure that prompts use of a redundant unit
(spatial redundancy) are what you need to deal with.

There is a very huge literature on this, for all sorts of scenarios.. The
Byzantine Generals problem is a classic example.  Synchronization of time
in a system is another.

The challenge is in devising generalized software approaches that can
effectively make use of redundancy (in whatever form).  By effective, I
mean that the computation runs in the same amount of time (or consumes the
same amount of some other resource) regardless of the occurrence of some
number of failures. From an information theory standpoint, I think that
means you MUST have redundancy, and the trick is efficient use of that

For communications channels, we are at the point where coding can get you
within hundredths of a dB of the Shannon limit.

For algorithms, not so much.  We've got good lossless compression
algorithms, which is a start.  You remove redundancy from the input data
stream, reducing the user data rate, and you can make use of the "extra"
bandwidth to do effective coding to mitigate errors in the channel.

However, while this compress/error correcting code/decompress is more
reliable/efficient, it does have longer latency (Shannon just sets the
limit, and assumes you have infinite memory on both ends of the link).

So in a computational scenario, that latency might be a real problem.

>If we go to a little nitty-gritty detail view,  you will see that
>transient faults are the ultimate enemies of exacscale computing. The
>problem, if we really go to the nitty-gritty details, stems from
>mismatch between the MPI assumptions and what the OSI model promises.
>To be exact, the OSI layers 1-4 can defend packet data losses and
>corruptions against transient hardware and network failures. Layers
>5-7 provides no protection. MPI sits on top of layer 7. And it assumes
>that every transmission must be successful (this is why we have to use
>checkpoint in the first place) -- a reliability assumption that the
>OSI model have never promised.
>In other words, any transient fault while processing the codes in
>layers 5-7 (and MPI calls) can halt the entire app.
>>> I would suggest that some scheme of redundant computation might be more
>>> effective.. Rather than try to store a single node's state on the node,
>>> and then, if any node hiccups, restore the state (perhaps to a spare),
>>> restart, means stopping the entire cluster while you recover.
>> I am not 100% about the nitty-gritty here, but I do believe there are
>> schemes already in place to deal with single node failures.  What I do
>> know for sure is that checkpoints are used as a last line of defense
>> against full cluster failure due to overheating, power failure, or
>> excessive numbers of concurrent failures -- not for just one node going
>> belly up.
>> The LANL clusters I was learning about only checkpointed every 4-6 hours
>> or so, if I remember correctly.  With hundred-petascale clusters and
>> beyond hitting failure rates on the frequency of not even hours but
>> minutes, obviously checkpointing is not the go-to first attempt at
>> failure recovery.
>> If I find some of the nitty-gritty I'm currently forgetting about how
>> smaller, isolated failures are handled now I'll report back.
>> Nevertheless, great ideas Jim!
>> Best,
>> ellis
