[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Wed Nov 3 06:46:35 PST 2004

==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:

[snip]

brian.dobbing> Background: The reason we're looking for a checkpoint/restart 
brian.dobbins> option has more to do with preempting a running job (of a lower
brian.dobbins> priority) by checkpointing it than it does with saving the
brian.dobbins> state in case of a crash.  While functionally these may be
brian.dobbins> pretty close or the same, if that gives rise to another
brian.dobbins> solution, I'd like to hear it.  In essence, we have some
brian.dobbins> Monte Carlo sims which are highly parallel, and could run
brian.dobbins> 24-7 for many months, but we want to be able to submit a
brian.dobbins> high priority CFD code that will take over, run for a few
brian.dobbins> days or so, and then have the system automagically restart
brian.dobbins> the MC sim.

How about sending the process a SIGSTOP followed by a SIGCONT when you are
ready to resume execution?  So long as your memory footprints of the two
apps won't exhaust physical ram + swap, this should be okay.  This assumes
a great deal about the robustness of your long running job, though.

-Jeff