[Beowulf] Checkpointing MPI applications
chris at csamuel.org
Thu Mar 23 19:46:02 UTC 2023
On 2/19/23 10:26 am, Scott Atchley wrote:
> Hi Chris,
> It looks like it tries to checkpoint application state without
> checkpointing the application or its libraries (including MPI). I am
> curious if the checkpoint sizes are similar or significantly larger to
> the application's typical outputs/checkpoints. If they are much larger,
> the time to write will be higher and they will stress capacity more.
Hmm, I'm not sure (my involvement is relatively peripheral) but I think
we want to see this used with apps that have no existing C/R mechanism.
If you ping me directly I can point you to people who will know more
than I on this.
> We are looking at SCR for Frontier with the idea that users can store
> checkpoints on the node-local drives with replication to a buddy node.
> SCR will manage migrating non-defensive checkpoints to Lustre.
Interesting, does it really need local storage or can it be used with
diskless systems via tricks with loopback filesystems, etc?
All the best,
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the Beowulf