[Beowulf] Supercomputers face growing resilience problems

Fri Nov 23 11:45:21 PST 2012

And this is the bit that concerns me the most.  At scale you should only be
making two assumptions: (1) everything breaks all the time (2) you will
have network partitions.  Checkpoint/restart is a lazy option that has no
place in modern software. Yet there doesn't seem to be a priority to go
beyond checkpoint restart and rethinking software architecture. I would
argue that's as much or more important than figuring out manycore.

On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C)
<james.p.lux at jpl.nasa.gov>wrote:

> a lot of HPC software design
> assumes perfect hardware, or, that the hardware failure rate is
> sufficiently low that a checkpoint/restart (or "do it all over from the
> beginning") is an acceptable strategy.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20121123/882024bf/attachment.html>