[Beowulf] Checkpointing using flash
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Fri Sep 21 09:34:49 PDT 2012
On 9/21/12 9:21 AM, "Hearns, John" <john.hearns at mclaren.com> wrote:
>
>Or, if you can factor your computation to make use of extra processing
>nodes, you can just keep on moving. Think of this as a higher level
>scheme than, say, Hamming codes for memory protection: use 11 bits to
>store 8, and you're still synchronous.
>
>Jim, you are smarter than me!
>IW as going to air the idea of pairs of nodes in lock-step, with either
>node being able to STONITH the other if
>either there is a machine check event, or the other node does not keep up
>with reporting results.
>Then signal to the cluster management that "There's been a failure here -
>but lets keep trucking to the end of the run,
>When you can come along and replace my buddy and me"
>
>The obvious drawback being you get half an exaflop for your money!
>
I was assuming that you'd figure out a Hamming-esque way to get 8/11ths of
an exaflop for an exaflops worth of horsepower.
It might actually be an ok trade without the future "Hearns Code",
though.. Can you get computers with double the failure rate for less than
half the cost (all in, capex and opex)? Given that we are inevitably
moving this way, maybe "design for perfect" isn't an appropriate paradigm.
In the space biz, this is a HUGE issue.. For all we spend trying to make
perfect, we don't, so is it time to bite the bullet and "design for
failure"... I think it is, but, there are those with beards grayer than
mine (and mine has a fair amount of gray in it) who don¹t.
>
More information about the Beowulf
mailing list