[Beowulf] Checkpointing MPI applications

Scott Atchley e.scott.atchley at gmail.com
Sun Feb 19 18:26:16 UTC 2023


Hi Chris,

It looks like it tries to checkpoint application state without
checkpointing the application or its libraries (including MPI). I am
curious if the checkpoint sizes are similar or significantly larger to the
application's typical outputs/checkpoints. If they are much larger, the
time to write will be higher and they will stress capacity more.

We are looking at SCR for Frontier with the idea that users can store
checkpoints on the node-local drives with replication to a buddy node. SCR
will manage migrating non-defensive checkpoints to Lustre.

Scott

On Sat, Feb 18, 2023 at 3:43 PM Christopher Samuel <chris at csamuel.org>
wrote:

> Hi all,
>
> The list has been very quiet recently, so as I just posted something to
> the Slurm list in reply to the topic of checkpointing MPI applications I
> thought it might interest a few of you here (apologies if you've already
> seen it there).
>
> If you're looking to try checkpointing MPI applications you may want to
> experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin
> for the DMTCP C/R effort here:
>
> https://github.com/mpickpt/mana
>
> We (NERSC) are collaborating with the developers and it is installed on
> Cori (our older Cray system) for people to experiment with. The
> documentation for it may be useful to others who'd like to try it out -
> it's got a nice description of how it works too which even I, as a
> non-programmer, can understand.
>
> https://docs.nersc.gov/development/checkpoint-restart/mana/
>
> Pay special attention to the caveats in our docs though!
>
> I've not used it myself, though I'm peripherally involved to give advice
> on system related issues.
>
> I'm curious if there are other methods that people are using out there
> for transparent checkpointing of MPI applications?
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20230219/1ec2b031/attachment.htm>


More information about the Beowulf mailing list