Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Glen Gardner Glen.Gardner at verizon.net
Wed Nov 3 15:31:48 PST 2004


Yo uwill probably end up having to use b locking message passing to make 
the processes wait at each chekpoint. The end result is that you lose a 
significant amount if performance t owaiting for all the programs to get 
to an appropriate chekpoint and wait for some kind of validation.


Glen Gardner


Reuti wrote:

> Jeff Moyer wrote:
>
>> ==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels 
>> for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:
>>
>> [snip]
>>
>> brian.dobbing> Background: The reason we're looking for a 
>> checkpoint/restart brian.dobbins> option has more to do with 
>> preempting a running job (of a lower
>> brian.dobbins> priority) by checkpointing it than it does with saving 
>> the
>> brian.dobbins> state in case of a crash. While functionally these may be
>> brian.dobbins> pretty close or the same, if that gives rise to another
>> brian.dobbins> solution, I'd like to hear it. In essence, we have some
>> brian.dobbins> Monte Carlo sims which are highly parallel, and could run
>> brian.dobbins> 24-7 for many months, but we want to be able to submit a
>> brian.dobbins> high priority CFD code that will take over, run for a few
>> brian.dobbins> days or so, and then have the system automagically 
>> restart
>> brian.dobbins> the MC sim.
>>
>> How about sending the process a SIGSTOP followed by a SIGCONT when 
>> you are
>> ready to resume execution? So long as your memory footprints of the two
>> apps won't exhaust physical ram + swap, this should be okay. This 
>> assumes
>> a great deal about the robustness of your long running job, though.
>>
>
> For parallel jobs this will lead to timing problems (depending on the 
> parallel libs used - you have to adjust at least any timeout for 
> missing communication, which may arrise in the libs). - Reuti
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593
Glen.Gardner at verizon.net


http://members.bellatlantic.net/~vze24qhw/index.html






More information about the Beowulf mailing list