[Beowulf] Supercomputers face growing resilience problems

Greg Lindahl lindahl at pbm.com
Fri Nov 23 16:07:31 PST 2012

On Thu, Nov 22, 2012 at 11:19:51PM -0500, Justin YUAN SHI wrote:
> The fundamental problem rests in our programming API. If you look at
> MPI and OpenMP carefully, you will find that these and all others have
> one common assumption: the application-level communication is always
> successful.


You keep on saying this, but it's simply not true. MPI implementations
typically retry until success, and if the communication network has a
failure that can be fixed by retry or hot swapping, the application
need not fail.

It's _node_ failure that's the problem, not "application-level
communication" failure.

I've met a lot of people studying adding fault-tolerance into
scientific computing over the past decade and a half, and none of
them have been this unclear when describing the basic issue.

Your description of your proposed solution is also super unclear.

-- greg

