[Beowulf] Checkpointing using flash
deadline at eadline.org
Fri Sep 21 14:10:19 PDT 2012
> I would suggest that some scheme of redundant computation might be more
> effective.. Rather than try to store a single node's state on the node,
> and then, if any node hiccups, restore the state (perhaps to a spare), and
> restart, means stopping the entire cluster while you recover.
> Or, if you can factor your computation to make use of extra processing
> nodes, you can just keep on moving. Think of this as a higher level
> scheme than, say, Hamming codes for memory protection: use 11 bits to
> store 8, and you're still synchronous.
One similar avenue I have thought about is what I call dynamic redundancy.
It requires a top level divide and conquer like approach where
independent "parts" can fail without causing the others to fail, because
the assumption is something will fail.
Depending on the resource load you can dial up how much redundancy you
want so that a range of the "parts" will be running redundantly when one
or some of them fail, the others take over. At one end of the dial
redundant and execution is slower. At the other end nothing is redundant and
execution is fastest. In between you would be betting that running every N
parts redundantly will increase your odds of hitting a failure on a
If you choose no redundancy, the program could end up waiting at
communication points for the failed part to respawn, complete
and then continue at the exchange point. Worst case would be failure just
before a "parts" completion. If a "part" failed half way though its run,
it would only be halfway behind the others and if everyone else is
waiting, respawn the failed "part(s)" with redundancy to ensure they get
You could also have schemes where the grain size of the parallelism could
be used to adjust the redundancy. i.e. if there are idle resources then why
not use them for redundancy just in case. Lots of interesting ways keep
things moving if you run in a dynamic fashion.
Furthermore, I think an Erlang like runtime system will be needed so that
you can change code while the program is running. In general, I find this
to be an interesting exercise - design parallel codes that have a range
of messaging times, from almost instant to never.
--snipped the rest--
More information about the Beowulf