[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Andrew Wang
andrewxwang at yahoo.com.tw
Wed Nov 3 15:38:18 PST 2004
Did you look at SGE+Berkeley Lab checkpoint? This is
the HOWTO:
http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf
And also LAM+berkeley lab checkpoint?
"The LAM/MPI Checkpoint/Restart Framework:
System-Initiated Checkpointing":
http://www.lam-mpi.org/papers/lacsi2003/
Andrew.
--- Brian Dobbins <brian.dobbins at yale.edu> 的訊息:
> Hi guys,
>
> I have just begun looking into a checkpoint /
> restart capability for
> clusters, but looking into the archives here and
> doing a search has
> shown few viable solutions. Some, like CKPOX (1),
> appear to be only
> written for the 2.4 series kernels, and I recall
> seeing one product that
> seemed to indicate it had full support for these
> operations, but it was
> a commercial product.
>
> What solutions have people on this list used for
> this functionality?
> Amy I restricted to going back to the 2.4 series?
> (I'd prefer to run
> 2.6 on the AMD64 hardware I've got.)
>
> Additionally, though this is a much wider question
> (and one tackled
> before!), what are people's pros and cons of the
> various queuing
> systems? I've played with OpenPBS before, and
> 'seen' SGE, but once
> again, I thought it'd be nice to hear what some of
> the heavy hitters on
> this list prefer.
>
> Background: The reason we're looking for a
> checkpoint/restart option
> has more to do with preempting a running job (of a
> lower priority) by
> checkpointing it than it does with saving the state
> in case of a crash.
> While functionally these may be pretty close or the
> same, if that gives
> rise to another solution, I'd like to hear it. In
> essence, we have some
> Monte Carlo sims which are highly parallel, and
> could run 24-7 for many
> months, but we want to be able to submit a high
> priority CFD code that
> will take over, run for a few days or so, and then
> have the system
> automagically restart the MC sim.
>
> Any advice would be great!
>
> Thanks very much for your time,
> - Brian
>
> Brian Dobbins
> Yale Mechanical Engineering
>
> --
> Brian Dobbins <brian.dobbins at yale.edu>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-----------------------------------------------------------------
Yahoo!奇摩Messenger6.0
更即時有趣的即時通訊世界,立即下載最新版!
http://tw.messenger.yahoo.com/
More information about the Beowulf
mailing list