[Beowulf] Supercomputers face growing resilience problems
mndoci at gmail.com
Fri Nov 23 11:45:21 PST 2012
And this is the bit that concerns me the most. At scale you should only be
making two assumptions: (1) everything breaks all the time (2) you will
have network partitions. Checkpoint/restart is a lazy option that has no
place in modern software. Yet there doesn't seem to be a priority to go
beyond checkpoint restart and rethinking software architecture. I would
argue that's as much or more important than figuring out manycore.
On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C)
<james.p.lux at jpl.nasa.gov>wrote:
> a lot of HPC software design
> assumes perfect hardware, or, that the hardware failure rate is
> sufficiently low that a checkpoint/restart (or "do it all over from the
> beginning") is an acceptable strategy.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf