[Beowulf] Supercomputers face growing resilience problems

Fri Nov 23 06:44:00 PST 2012

It's not that there aren't solutions available for specific problems.. The
challenge is that some of the solutions don't scale well OR that they are
not generalized enough to handle the gamut of non-EP kinds of problems.

I don't think there will be a silver bullet that fixes everything, but I
think we'll evolve to some classes of solutions to solve certain classes
of problems.  After all, we don't do the same error correction codes on
memory and hard disk.

But the basic underlying comment is right:  a lot of HPC software design
assumes perfect hardware, or, that the hardware failure rate is
sufficiently low that a checkpoint/restart (or "do it all over from the
beginning") is an acceptable strategy.

This is fine.. It's hard enough to figure out how to
parallelize/clusterize the solution (having taken some decades to do it).
I'm confident that over the next few decades we'll figure out how to deal
with unreliable hardware/software.  (because, after all, software bugs are
a problem too)

On 11/23/12 2:29 AM, "Luc Vereecken" <kineticluc at gmail.com> wrote:

>At the same time, there are API (e.g. HTCondor) that do not assume
>successful communications or computation; they are used in large
>distributed computing projects (SETI at HOME, FOLDING at HOME, distributed.net
>(though I don't think they have a toolbox available)). For
>embarrassingly parallel workloads, they can be a good match; for tightly
>coupled workloads, not always.
>
>Luc
>
>
>
>On 11/23/2012 5:19 AM, Justin YUAN SHI wrote:
>> The fundamental problem rests in our programming API. If you look at
>> MPI and OpenMP carefully, you will find that these and all others have
>> one common assumption: the application-level communication is always
>> successful.
>>
>> We knew full well that this cannot be true.