[Beowulf] Checkpointing using flash

Sat Sep 22 04:02:14 PDT 2012

> To be exact, the OSI layers 1-4 can defend packet data losses and
> corruptions against transient hardware and network failures. Layers
> 5-7 provides no protection. MPI sits on top of layer 7. And it assumes
> that every transmission must be successful (this is why we have to use
> checkpoint in the first place) -- a reliability assumption that the
> OSI model have never promised.

I've been playing around with GFS and Gluster a bit recently and this
has got me thinking... Given a fast enough, low enough latency network
might it but possible to have a Gluster like or GFS like memory space?

GFS like would involve an external box of memory with connections to,
lets say for argument 5 + 1 nodes. In the event of failure the "hot
spare" could take over processing of the failed node. Perhaps we can
stop thinking about nodes and instead think about clusters of
processors and memory within clusters?

"Glusterlike" would work quite like Gluster but for memory. I can kind
of see it in my head but am having problems describing it :)

As far as I can understand (which is very limited) this whole Exascale
thing is not going to happen with traditional Beowulf, Just a bunch of
nodes blah blah. Machines are going to have to become far more tightly
coupled with clever hardware tricks to protect against failed memory
modules etc.

I am almost sure this is why Intel recently bought chunks of Qlogic
and Cray tech. Obviously your quite limited in what you can reasonably
do with copper over distance but perhaps optical interconnects could
provide some kind of answer...

>
> In other words, any transient fault while processing the codes in
> layers 5-7 (and MPI calls) can halt the entire app.
>
> Justin
>
>
>
> On Fri, Sep 21, 2012 at 12:29 PM, Ellis H. Wilson III <ellis at cse.psu.edu> wrote:
>> On 09/21/12 12:13, Lux, Jim (337C) wrote:
>>> I would suggest that some scheme of redundant computation might be more
>>> effective.. Rather than try to store a single node's state on the node,
>>> and then, if any node hiccups, restore the state (perhaps to a spare), and
>>> restart, means stopping the entire cluster while you recover.
>>
>> I am not 100% about the nitty-gritty here, but I do believe there are
>> schemes already in place to deal with single node failures.  What I do
>> know for sure is that checkpoints are used as a last line of defense
>> against full cluster failure due to overheating, power failure, or
>> excessive numbers of concurrent failures -- not for just one node going
>> belly up.
>>
>> The LANL clusters I was learning about only checkpointed every 4-6 hours
>> or so, if I remember correctly.  With hundred-petascale clusters and
>> beyond hitting failure rates on the frequency of not even hours but
>> minutes, obviously checkpointing is not the go-to first attempt at
>> failure recovery.
>>
>> If I find some of the nitty-gritty I'm currently forgetting about how
>> smaller, isolated failures are handled now I'll report back.
>>
>> Nevertheless, great ideas Jim!
>>
>> Best,
>>
>> ellis
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf