[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Reuti
reuti at staff.uni-marburg.de
Wed Nov 3 09:36:17 PST 2004
Jeff Moyer wrote:
> ==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:
>
> [snip]
>
> brian.dobbing> Background: The reason we're looking for a checkpoint/restart
> brian.dobbins> option has more to do with preempting a running job (of a lower
> brian.dobbins> priority) by checkpointing it than it does with saving the
> brian.dobbins> state in case of a crash. While functionally these may be
> brian.dobbins> pretty close or the same, if that gives rise to another
> brian.dobbins> solution, I'd like to hear it. In essence, we have some
> brian.dobbins> Monte Carlo sims which are highly parallel, and could run
> brian.dobbins> 24-7 for many months, but we want to be able to submit a
> brian.dobbins> high priority CFD code that will take over, run for a few
> brian.dobbins> days or so, and then have the system automagically restart
> brian.dobbins> the MC sim.
>
> How about sending the process a SIGSTOP followed by a SIGCONT when you are
> ready to resume execution? So long as your memory footprints of the two
> apps won't exhaust physical ram + swap, this should be okay. This assumes
> a great deal about the robustness of your long running job, though.
>
For parallel jobs this will lead to timing problems (depending on the
parallel libs used - you have to adjust at least any timeout for missing
communication, which may arrise in the libs). - Reuti
More information about the Beowulf
mailing list