[Beowulf] Checkpointing using flash

Andrew Holway andrew.holway at gmail.com
Sun Sep 23 06:57:24 PDT 2012


2012/9/21 David N. Lombard <dnlombar at ichips.intel.com>:
> Our primary approach today is recovery-base resilience, a.k.a.,
> checkpoint-restart (C/R). I'm not convinced we can continue to rely on that
> at exascale.

- Snapshotting seems to be an ugly and inelegant way of solving the
problem. For me it is especially laughable considering the general
crappyness of acedemic codes in general. It pushes to much onus on the
users who, lets face it, are great at science but generally suck at
the art of coding :). Saying that. Maybe there will be some kind of
super elegant snapshotting library that makes it all work really well.
But I doubt it will be universally sexy and, to my ear, sounds like it
would bind us to a particular coding paradigm. I might be completely
getting the wrong end of the stick however.

2012/9/22 Lux, Jim (337C) <james.p.lux at jpl.nasa.gov>:
> But isn't that basically the old multiport memory or crossbar switch kind
> of thing? (Giant memory shared by multiple processors).
>
> Aside from things like cache coherency, it has scalability problems (from
> physical distance reasons: propagation time, if nothing else)

- Agreed. Doing distributed memory where processor 1 tries to access
the memory of processor 1000 which might be several tens of meters
away would(I think) be a non starter because of the propagation and
signaling rate versus distance problem. The beardy gods gave us MPI
for this :)

I started a new thread on RAIM. It does look a bit crossbar I'll grant you :)



More information about the Beowulf mailing list