Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Jeff Moyer jmoyer at redhat.com
Wed Nov 3 06:46:35 PST 2004


==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:

[snip]

brian.dobbing> Background: The reason we're looking for a checkpoint/restart 
brian.dobbins> option has more to do with preempting a running job (of a lower
brian.dobbins> priority) by checkpointing it than it does with saving the
brian.dobbins> state in case of a crash.  While functionally these may be
brian.dobbins> pretty close or the same, if that gives rise to another
brian.dobbins> solution, I'd like to hear it.  In essence, we have some
brian.dobbins> Monte Carlo sims which are highly parallel, and could run
brian.dobbins> 24-7 for many months, but we want to be able to submit a
brian.dobbins> high priority CFD code that will take over, run for a few
brian.dobbins> days or so, and then have the system automagically restart
brian.dobbins> the MC sim.

How about sending the process a SIGSTOP followed by a SIGCONT when you are
ready to resume execution?  So long as your memory footprints of the two
apps won't exhaust physical ram + swap, this should be okay.  This assumes
a great deal about the robustness of your long running job, though.

-Jeff



More information about the Beowulf mailing list