[Beowulf] Checkpointing using flash

Fri Sep 21 09:21:16 PDT 2012

Or, if you can factor your computation to make use of extra processing
nodes, you can just keep on moving.  Think of this as a higher level
scheme than, say, Hamming codes for memory protection:  use 11 bits to
store 8, and you're still synchronous.

Jim, you are smarter than me!
IW as going to air the idea of pairs of nodes in lock-step, with either node being able to STONITH the other if
either there is a machine check event, or the other node does not keep up with reporting results.
Then signal to the cluster management that "There's been a failure here - but lets keep trucking to the end of the run,
When you can come along and replace my buddy and me"

The obvious drawback being you get half an exaflop for your money!

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.