[Beowulf] Re: checkpointing

Isaac Dooley idooley at isaacdooley.com
Wed Nov 3 14:56:46 PST 2004

Parallel job checkpointing is not easy. If you are running an MPI 
program, perhaps you could use the AMPI implementation available at 
http://charm.cs.uiuc.edu. I work on this project, and it can provide 
checkpointing of MPI programs. The implementation also allows for 
dynamic load balancing(process migration) in a few different flavors, as 
well as automatic fault tolerance. AMPI, and its underlying 
Charm/Converse system run on a wide range of architectures from 
workstation clusters to BlueGene. So if you have an MPI program, 
switching to AMPI may be trivial, and using the special load balancing 
features would requre a few extra function calls(but it may be possible 
to do asynchronous load balancing as well).

Also it is worthwhile to know exactly why you wish to checkpoint. 
Generally for large systems, say 5000 nodes, with long running 
applications(hours or days), it is needed to provide protection when a 
node dies.

Please send me any questions you may have about charm.
Isaac Dooley

>>I must say though that from what I know checkpointing/restarting
>>serial codes is OK.
>>Checkpointing parallel jobs is problematic, and from what I've read
>>not recommended (the various processes are passing
>>messages, and how do you checkpoint in a consistent state?).
>I would send a signal from SGE only to the head node of a let's say MPI 
>job. This rank 0 job has to set some special fields and broadcast this 
>to the slave processes. The slaves must check this from time to time and 
>send their state to the head node (and shut down in a proper way), which 
>is performing the storing of the information in any checkpointing place 
>on a shared file system (maybe we get different nodes the next time). I 
>think it's possible to program it (when it's included in the design of 
>the program), but adding it later to an already existing program is not 
>so easy. - Reuti

More information about the Beowulf mailing list