[Beowulf] Re: checkpointing
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Isaac Dooley idooley at isaacdooley.comWed Nov 3 14:56:46 PST 2004
- Previous message: [Beowulf] Interview for school paper on clustering
- Next message: [Beowulf] Interview for school paper on clustering
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Parallel job checkpointing is not easy. If you are running an MPI program, perhaps you could use the AMPI implementation available at http://charm.cs.uiuc.edu. I work on this project, and it can provide checkpointing of MPI programs. The implementation also allows for dynamic load balancing(process migration) in a few different flavors, as well as automatic fault tolerance. AMPI, and its underlying Charm/Converse system run on a wide range of architectures from workstation clusters to BlueGene. So if you have an MPI program, switching to AMPI may be trivial, and using the special load balancing features would requre a few extra function calls(but it may be possible to do asynchronous load balancing as well). Also it is worthwhile to know exactly why you wish to checkpoint. Generally for large systems, say 5000 nodes, with long running applications(hours or days), it is needed to provide protection when a node dies. Please send me any questions you may have about charm. Isaac Dooley >>I must say though that from what I know checkpointing/restarting >>serial codes is OK. >>Checkpointing parallel jobs is problematic, and from what I've read >>not recommended (the various processes are passing >>messages, and how do you checkpoint in a consistent state?). >> >> >> > >I would send a signal from SGE only to the head node of a let's say MPI >job. This rank 0 job has to set some special fields and broadcast this >to the slave processes. The slaves must check this from time to time and >send their state to the head node (and shut down in a proper way), which >is performing the storing of the information in any checkpointing place >on a shared file system (maybe we get different nodes the next time). I >think it's possible to program it (when it's included in the design of >the program), but adding it later to an already existing program is not >so easy. - Reuti > >
- Previous message: [Beowulf] Interview for school paper on clustering
- Next message: [Beowulf] Interview for school paper on clustering
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
