[Beowulf] Re: checkpointing
idooley at isaacdooley.com
Wed Nov 3 14:56:46 PST 2004
Parallel job checkpointing is not easy. If you are running an MPI
program, perhaps you could use the AMPI implementation available at
http://charm.cs.uiuc.edu. I work on this project, and it can provide
checkpointing of MPI programs. The implementation also allows for
dynamic load balancing(process migration) in a few different flavors, as
well as automatic fault tolerance. AMPI, and its underlying
Charm/Converse system run on a wide range of architectures from
workstation clusters to BlueGene. So if you have an MPI program,
switching to AMPI may be trivial, and using the special load balancing
features would requre a few extra function calls(but it may be possible
to do asynchronous load balancing as well).
Also it is worthwhile to know exactly why you wish to checkpoint.
Generally for large systems, say 5000 nodes, with long running
applications(hours or days), it is needed to provide protection when a
Please send me any questions you may have about charm.
>>I must say though that from what I know checkpointing/restarting
>>serial codes is OK.
>>Checkpointing parallel jobs is problematic, and from what I've read
>>not recommended (the various processes are passing
>>messages, and how do you checkpoint in a consistent state?).
>I would send a signal from SGE only to the head node of a let's say MPI
>job. This rank 0 job has to set some special fields and broadcast this
>to the slave processes. The slaves must check this from time to time and
>send their state to the head node (and shut down in a proper way), which
>is performing the storing of the information in any checkpointing place
>on a shared file system (maybe we get different nodes the next time). I
>think it's possible to program it (when it's included in the design of
>the program), but adding it later to an already existing program is not
>so easy. - Reuti
More information about the Beowulf