[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Glen Gardner Glen.Gardner at verizon.netWed Nov 3 15:31:48 PST 2004
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Yo uwill probably end up having to use b locking message passing to make the processes wait at each chekpoint. The end result is that you lose a significant amount if performance t owaiting for all the programs to get to an appropriate chekpoint and wait for some kind of validation. Glen Gardner Reuti wrote: > Jeff Moyer wrote: > >> ==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels >> for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds: >> >> [snip] >> >> brian.dobbing> Background: The reason we're looking for a >> checkpoint/restart brian.dobbins> option has more to do with >> preempting a running job (of a lower >> brian.dobbins> priority) by checkpointing it than it does with saving >> the >> brian.dobbins> state in case of a crash. While functionally these may be >> brian.dobbins> pretty close or the same, if that gives rise to another >> brian.dobbins> solution, I'd like to hear it. In essence, we have some >> brian.dobbins> Monte Carlo sims which are highly parallel, and could run >> brian.dobbins> 24-7 for many months, but we want to be able to submit a >> brian.dobbins> high priority CFD code that will take over, run for a few >> brian.dobbins> days or so, and then have the system automagically >> restart >> brian.dobbins> the MC sim. >> >> How about sending the process a SIGSTOP followed by a SIGCONT when >> you are >> ready to resume execution? So long as your memory footprints of the two >> apps won't exhaust physical ram + swap, this should be okay. This >> assumes >> a great deal about the robustness of your long running job, though. >> > > For parallel jobs this will lead to timing problems (depending on the > parallel libs used - you have to adjust at least any timeout for > missing communication, which may arrise in the libs). - Reuti > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 Glen.Gardner at verizon.net http://members.bellatlantic.net/~vze24qhw/index.html
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
