[Beowulf] Checkpointing using flash
Ellis H. Wilson III
ellis at cse.psu.edu
Fri Sep 21 10:09:41 PDT 2012
On 09/21/12 12:58, Lux, Jim (337C) wrote:
> Yes.. If that's the frequency of checkpoints. I was thinking more like 1
> checkpoint per second or 10 seconds.
While I suppose they might exist that frequent somehow in the wild, I've
never heard of checkpoints at that low of time interval. These huge
cluster checkpoints are near to the entire memories, so even today we're
talking near to 64 or 128 GB of RAM per node. In ten years we're
talking what, near to if not above a TB of RAM per node? Moreover, they
all tend to write their checkpoint at the same time and the SSDs aren't
on the compute nodes -- they're on some intermediate I/O storage nodes
(akin to BlueGene's intermediate layer). So were talking about huge
cluster-wide dumps of data to the flash intermediate layer, which then
takes some hours to dump that data down to the more persistent HDDs.
This takes at the very least many minutes, and in the normal case hours.
I would not be surprised if the best they could do at exascale was one
checkpoint a day. Again, I don't think these are used as the front-line
of defense against failures. That would really suck :D.
More information about the Beowulf