[Beowulf] Supercomputers face growing resilience problems
Greg Lindahl
lindahl at pbm.com
Fri Nov 23 16:07:31 PST 2012
On Thu, Nov 22, 2012 at 11:19:51PM -0500, Justin YUAN SHI wrote:
> The fundamental problem rests in our programming API. If you look at
> MPI and OpenMP carefully, you will find that these and all others have
> one common assumption: the application-level communication is always
> successful.
Justin,
You keep on saying this, but it's simply not true. MPI implementations
typically retry until success, and if the communication network has a
failure that can be fixed by retry or hot swapping, the application
need not fail.
It's _node_ failure that's the problem, not "application-level
communication" failure.
I've met a lot of people studying adding fault-tolerance into
scientific computing over the past decade and a half, and none of
them have been this unclear when describing the basic issue.
Your description of your proposed solution is also super unclear.
-- greg
More information about the Beowulf
mailing list