[Beowulf] Checkpointing MPI applications

Christopher Samuel chris at csamuel.org
Sat Feb 18 20:42:54 UTC 2023


Hi all,

The list has been very quiet recently, so as I just posted something to 
the Slurm list in reply to the topic of checkpointing MPI applications I 
thought it might interest a few of you here (apologies if you've already 
seen it there).

If you're looking to try checkpointing MPI applications you may want to 
experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin 
for the DMTCP C/R effort here:

https://github.com/mpickpt/mana

We (NERSC) are collaborating with the developers and it is installed on 
Cori (our older Cray system) for people to experiment with. The 
documentation for it may be useful to others who'd like to try it out - 
it's got a nice description of how it works too which even I, as a 
non-programmer, can understand.

https://docs.nersc.gov/development/checkpoint-restart/mana/

Pay special attention to the caveats in our docs though!

I've not used it myself, though I'm peripherally involved to give advice 
on system related issues.

I'm curious if there are other methods that people are using out there 
for transparent checkpointing of MPI applications?

All the best,
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


More information about the Beowulf mailing list