[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Wed Nov 3 09:36:17 PST 2004

Jeff Moyer wrote:
> ==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:
> 
> [snip]
> 
> brian.dobbing> Background: The reason we're looking for a checkpoint/restart 
> brian.dobbins> option has more to do with preempting a running job (of a lower
> brian.dobbins> priority) by checkpointing it than it does with saving the
> brian.dobbins> state in case of a crash.  While functionally these may be
> brian.dobbins> pretty close or the same, if that gives rise to another
> brian.dobbins> solution, I'd like to hear it.  In essence, we have some
> brian.dobbins> Monte Carlo sims which are highly parallel, and could run
> brian.dobbins> 24-7 for many months, but we want to be able to submit a
> brian.dobbins> high priority CFD code that will take over, run for a few
> brian.dobbins> days or so, and then have the system automagically restart
> brian.dobbins> the MC sim.
> 
> How about sending the process a SIGSTOP followed by a SIGCONT when you are
> ready to resume execution?  So long as your memory footprints of the two
> apps won't exhaust physical ram + swap, this should be okay.  This assumes
> a great deal about the robustness of your long running job, though.
> 

For parallel jobs this will lead to timing problems (depending on the 
parallel libs used - you have to adjust at least any timeout for missing 
communication, which may arrise in the libs). - Reuti