[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jeff Moyer jmoyer at redhat.comWed Nov 3 06:46:35 PST 2004
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds: [snip] brian.dobbing> Background: The reason we're looking for a checkpoint/restart brian.dobbins> option has more to do with preempting a running job (of a lower brian.dobbins> priority) by checkpointing it than it does with saving the brian.dobbins> state in case of a crash. While functionally these may be brian.dobbins> pretty close or the same, if that gives rise to another brian.dobbins> solution, I'd like to hear it. In essence, we have some brian.dobbins> Monte Carlo sims which are highly parallel, and could run brian.dobbins> 24-7 for many months, but we want to be able to submit a brian.dobbins> high priority CFD code that will take over, run for a few brian.dobbins> days or so, and then have the system automagically restart brian.dobbins> the MC sim. How about sending the process a SIGSTOP followed by a SIGCONT when you are ready to resume execution? So long as your memory footprints of the two apps won't exhaust physical ram + swap, this should be okay. This assumes a great deal about the robustness of your long running job, though. -Jeff
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
