[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Glen Gardner
Glen.Gardner at verizon.net
Wed Nov 3 15:31:48 PST 2004
Yo uwill probably end up having to use b locking message passing to make
the processes wait at each chekpoint. The end result is that you lose a
significant amount if performance t owaiting for all the programs to get
to an appropriate chekpoint and wait for some kind of validation.
Glen Gardner
Reuti wrote:
> Jeff Moyer wrote:
>
>> ==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels
>> for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:
>>
>> [snip]
>>
>> brian.dobbing> Background: The reason we're looking for a
>> checkpoint/restart brian.dobbins> option has more to do with
>> preempting a running job (of a lower
>> brian.dobbins> priority) by checkpointing it than it does with saving
>> the
>> brian.dobbins> state in case of a crash. While functionally these may be
>> brian.dobbins> pretty close or the same, if that gives rise to another
>> brian.dobbins> solution, I'd like to hear it. In essence, we have some
>> brian.dobbins> Monte Carlo sims which are highly parallel, and could run
>> brian.dobbins> 24-7 for many months, but we want to be able to submit a
>> brian.dobbins> high priority CFD code that will take over, run for a few
>> brian.dobbins> days or so, and then have the system automagically
>> restart
>> brian.dobbins> the MC sim.
>>
>> How about sending the process a SIGSTOP followed by a SIGCONT when
>> you are
>> ready to resume execution? So long as your memory footprints of the two
>> apps won't exhaust physical ram + swap, this should be okay. This
>> assumes
>> a great deal about the robustness of your long running job, though.
>>
>
> For parallel jobs this will lead to timing problems (depending on the
> parallel libs used - you have to adjust at least any timeout for
> missing communication, which may arrise in the libs). - Reuti
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
--
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593
Glen.Gardner at verizon.net
http://members.bellatlantic.net/~vze24qhw/index.html
More information about the Beowulf
mailing list