[Beowulf] Checkpointing using flash
Ellis H. Wilson III
ellis at cse.psu.edu
Fri Sep 21 04:29:54 PDT 2012
On 09/21/12 10:49, Hearns, John wrote:
> Frequent checkpointing will of course be vital for exascale, given the
> MTBF of individual nodes.
> However how accurate is this statement:
> HPC jobs involving half a million compute cores ... have a series of
> checkpoints set up in their code with the entire memory state stored at
> each checkpoint in a storage node.
Are your concerns about the accuracy of this statement related to the
fact that elReg is claiming that they must dump "the entire memory" or
some concern about flash being used as a temporary checkpointing medium?
If the former -- note that with many, many physics and climate codes the
application data dominates memory. So while it may not be technically
true that the "entire memory" is dumped in the checkpoint (the OS
certainly won't/shouldn't dump it's own memory), it is effectively true
because 90% of the memory does end up getting dumped.
For what it's worth, flash (or some other reasonably dense medium faster
than disk) being used in exascale machines is an absolute necessity for
checkpointing according to my research and discussions. I was lucky
enough to sit in on a talk by Gary Grider of LANL last Fall (the guy
that basically designs and signs off on the purchase of their largest
clusters, from what I understand) and John Bent (also of LANL, now at
EMC). They explained the nasty costs involved if they went totally disk
or totally flash. A hybrid solution was effectively the only
cost-effective way to do this for them, and I expect we'll see similar
trends in other labs in the near future. I don't even think he was
talking full exascale either -- like 100 petaflop.
Disclaimer: Possible Bias -- My research is on flash development and
caching for cluster computing at PSU.
More information about the Beowulf