[Beowulf] Checkpointing using flash

Sun Sep 23 09:51:30 PDT 2012

On 9/23/12 6:57 AM, "Andrew Holway" <andrew.holway at gmail.com> wrote:

>2012/9/21 David N. Lombard <dnlombar at ichips.intel.com>:
>> Our primary approach today is recovery-base resilience, a.k.a.,
>> checkpoint-restart (C/R). I'm not convinced we can continue to rely on
>>that
>> at exascale.
>
>- Snapshotting seems to be an ugly and inelegant way of solving the
>problem. For me it is especially laughable considering the general
>crappyness of acedemic codes in general. It pushes to much onus on the
>users who, lets face it, are great at science but generally suck at
>the art of coding :). Saying that. Maybe there will be some kind of
>super elegant snapshotting library that makes it all work really well.
>But I doubt it will be universally sexy and, to my ear, sounds like it
>would bind us to a particular coding paradigm. I might be completely
>getting the wrong end of the stick however.

Snapshot/Checkpoint *is* a brute force way, particularly for dealing with
hardware failures.  We used to do it to deal with power interruptions on
exhaustive search algorithms that took days. But it might be the only way
to do a "algorithm blind" approach.

>
>2012/9/22 Lux, Jim (337C) <james.p.lux at jpl.nasa.gov>:
>> But isn't that basically the old multiport memory or crossbar switch
>>kind
>> of thing? (Giant memory shared by multiple processors).
>>
>> Aside from things like cache coherency, it has scalability problems
>>(from
>> physical distance reasons: propagation time, if nothing else)
>
>- Agreed. Doing distributed memory where processor 1 tries to access
>the memory of processor 1000 which might be several tens of meters
>away would(I think) be a non starter because of the propagation and
>signaling rate versus distance problem. The beardy gods gave us MPI
>for this :)

The problem (such as it is) is that devising computational algorithms that
are aware of (or better, make use) of propagation delays is *hard*.  Think
about the old days (before my time) when people used to optimize for
placement on the drum.  That was an easy problem.

Dealing with errors.. The Feynman story of running simulations with punch
card equipment with different colored cards is an ad hoc specialized
solution.

Maybe a similar one is optimizing for vector machines and pipelined
processing or systolic arrays. Systolic array approaches definitely can
deal with the speed of light problem: latency through the system is longer
than 1/computation rate; but it's hard to find a generalized approach.

>