[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Chris Samuel
csamuel at vpac.org
Wed Nov 3 17:53:42 PST 2004
On Thu, 4 Nov 2004 04:36 am, Reuti wrote:
> For parallel jobs this will lead to timing problems (depending on the
> parallel libs used - you have to adjust at least any timeout for missing
> communication, which may arrise in the libs). - Reuti
My understanding is that the latest version of LAM-MPI supports checkpointing
of parallel jobs.
Their page http://www.lam-mpi.org/about/overview/ says:
----8< quote 8<----
Checkpoint/Restart
MPI applications running under LAM/MPI can be checkpointed to disk and
restarted at a later time. LAM requires a 3rd party single-process
checkpoint/restart toolkit for actually checkpointing and restarting a single
MPI process - LAM takes care of the parallel coordination. Currently, the
Berkeley Labs Checkpoint/Restart package (Linux only) is supported. The
infrastructure allows for easy addition of new checkpoint/restart packages.
----8< quote 8<----
The Berkeley labs package they mention (http://ftg.lbl.gov/checkpoint) is a
kernel module (not a kernel patch) for 2.4 series (though they have an open
bug report about porting this to the 2.6 series in their Bugzilla at
https://mantis.lbl.gov/bugzilla/show_bug.cgi?id=748) on IA32 (Opteron support
is bug 749 and depends on the 2.6 support).
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20041104/6eddb05c/attachment.sig>
More information about the Beowulf
mailing list