[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Chris Samuel csamuel at vpac.org
Wed Nov 3 17:53:42 PST 2004


On Thu, 4 Nov 2004 04:36 am, Reuti wrote:

> For parallel jobs this will lead to timing problems (depending on the
> parallel libs used - you have to adjust at least any timeout for missing
> communication, which may arrise in the libs). - Reuti

My understanding is that the latest version of LAM-MPI supports checkpointing 
of parallel jobs.

Their page http://www.lam-mpi.org/about/overview/  says:

 ----8< quote 8<----

Checkpoint/Restart
 MPI applications running under LAM/MPI can be checkpointed to disk and 
restarted at a later time. LAM requires a 3rd party single-process 
checkpoint/restart toolkit for actually checkpointing and restarting a single 
MPI process - LAM takes care of the parallel coordination. Currently, the 
Berkeley Labs Checkpoint/Restart package (Linux only) is supported. The 
infrastructure allows for easy addition of new checkpoint/restart packages. 

 ----8< quote 8<----

The Berkeley labs package they mention (http://ftg.lbl.gov/checkpoint) is a 
kernel module (not a kernel patch) for 2.4 series (though they have an open 
bug report about porting this to the 2.6 series in their Bugzilla at 
https://mantis.lbl.gov/bugzilla/show_bug.cgi?id=748) on IA32 (Opteron support 
is bug 749 and depends on the 2.6 support).

Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20041104/6eddb05c/attachment.sig>


More information about the Beowulf mailing list