[Beowulf] Supercomputers face growing resilience problems

Luc Vereecken kineticluc at gmail.com
Fri Nov 23 02:29:57 PST 2012


At the same time, there are API (e.g. HTCondor) that do not assume 
successful communications or computation; they are used in large 
distributed computing projects (SETI at HOME, FOLDING at HOME, distributed.net 
(though I don't think they have a toolbox available)). For 
embarrassingly parallel workloads, they can be a good match; for tightly 
coupled workloads, not always.

Luc



On 11/23/2012 5:19 AM, Justin YUAN SHI wrote:
> The fundamental problem rests in our programming API. If you look at
> MPI and OpenMP carefully, you will find that these and all others have
> one common assumption: the application-level communication is always
> successful.
>
> We knew full well that this cannot be true.
>
> Thus, the bigger the application we build, the higher probability of
> failure. This should not be a surprise.
>
> Proposed fault tolerance methods, such as redundant execution, is
> really like "borrow John to pay Paul" where both John and Paul are
> personal friends.
>
> What we need is a true sustainable solution that can gain performance
> and reliability at the same time as we up scale the application.
>
> This is NOT an impossible dream. The packet-switching network is a
> living example of such an architecture. The missing piece in HPC
> applications is the principle of statistic multiplexed computing. In
> other words, the application architecture should be considered as a
> whole in the design space, not a "glued" together piece using lower
> layers with unsealed semantic "holes". The semantic "holes" between
> the layers are the real evils for all our troubles.
>
> Our research exhibit (booth 3360) demonstrate a prototype data
> parallel system using this idea. The Sustainable HPC Cloud Workshop at
> the end of SC12 (Friday AM)) had one paper touching on this topic as
> well.
>
> Justin
>
>
>
> On Thu, Nov 22, 2012 at 5:03 AM, Eugen Leitl <eugen at leitl.org> wrote:
>>
>> http://www.computerworld.com.au/article/442703/supercomputers_face_growing_resilience_problems/
>>
>> Supercomputers face growing resilience problems
>>
>> As the number of components in large supercomputers grows, so does the
>> possibility of component failure
>>
>> Joab Jackson (IDG News Service)
>>
>> 21 November, 2012 21:58
>>
>> As supercomputers grow more powerful, they'll also grow more vulnerable to
>> failure, thanks to the increased amount of built-in componentry. A few
>> researchers at the recent SC12 conference, held last week in Salt Lake City,
>> offered possible solutions to this growing problem.
>>
>> Today's high-performance computing (HPC) systems can have 100,000 nodes or
>> more -- with each node built from multiple components of memory, processors,
>> buses and other circuitry. Statistically speaking, all these components will
>> fail at some point, and they halt operations when they do so, said David
>> Fiala, a Ph.D student at the North Carolina State University, during a talk
>> at SC12.
>>
>> The problem is not a new one, of course. When Lawrence Livermore National
>> Laboratory's 600-node ASCI (Accelerated Strategic Computing Initiative) White
>> supercomputer went online in 2001, it had a mean time between failures (MTBF)
>> of only five hours, thanks in part to component failures. Later tuning
>> efforts had improved ASCI White's MTBF to 55 hours, Fiala said.
>>
>> But as the number of supercomputer nodes grows, so will the problem.
>> "Something has to be done about this. It will get worse as we move to
>> exascale," Fiala said, referring to how supercomputers of the next decade are
>> expected to have 10 times the computational power that today's models do.
>>
>> Today's techniques for dealing with system failure may not scale very well,
>> Fiala said. He cited checkpointing, in which a running program is temporarily
>> halted and its state is saved to disk. Should the program then crash, the
>> system is able to restart the job from the last checkpoint.
>>
>> The problem with checkpointing, according to Fiala, is that as the number of
>> nodes grows, the amount of system overhead needed to do checkpointing grows
>> as well -- and grows at an exponential rate. On a 100,000-node supercomputer,
>> for example, only about 35 percent of the activity will be involved in
>> conducting work. The rest will be taken up by checkpointing and -- should a
>> system fail -- recovery operations, Fiala estimated.
>>
>> Because of all the additional hardware needed for exascale systems, which
>> could be built from a million or more components, system reliability will
>> have to be improved by 100 times in order to keep to the same MTBF that
>> today's supercomputers enjoy, Fiala said.
>>
>> Fiala presented technology that he and fellow researchers developed that may
>> help improve reliability. The technology addresses the problem of silent data
>> corruption, when systems make undetected errors writing data to disk.
>>
>> Basically, the researchers' approach consists of running multiple copies, or
>> "clones" of a program, simultaneously and then comparing the answers. The
>> software, called RedMPI, is run in conjunction with the Message Passing
>> Interface (MPI), a library for splitting running applications across multiple
>> servers so the different parts of the program can be executed in parallel.
>>
>> RedMPI intercepts and copies every MPI message that an application sends, and
>> sends copies of the message to the clone (or clones) of the program. If
>> different clones calculate different answers, then the numbers can be
>> recalculated on the fly, which will save time and resources from running the
>> entire program again.
>>
>> "Implementing redundancy is not expensive. It may be high in the number of
>> core counts that are needed, but it avoids the need for rewrites with
>> checkpoint restarts," Fiala said. "The alternative is, of course, to simply
>> rerun jobs until you think you have the right answer."
>>
>> Fiala recommended running two backup copies of each program, for triple
>> redundancy. Though running multiple copies of a program would initially take
>> up more resources, over time it may actually be more efficient, due to the
>> fact that programs would not need to be rerun to check answers. Also,
>> checkpointing may not be needed when multiple copies are run, which would
>> also save on system resources.
>>
>> "I think the idea of doing redundancy is actually a great idea. [For] very
>> large computations, involving hundreds of thousands of nodes, there certainly
>> is a chance that errors will creep in," said Ethan Miller, a computer science
>> professor at the University of California Santa Cruz, who attended the
>> presentation. But he said the approach may be not be suitable given the
>> amount of network traffic that such redundancy might create. He suggested
>> running all the applications on the same set of nodes, which could minimize
>> internode traffic.
>>
>> In another presentation, Ana Gainaru, a Ph.D student from the University of
>> Illinois at Urbana-Champaign, presented a technique of analyzing log files to
>> predict when system failures would occur.
>>
>> The work combines signal analysis with data mining. Signal analysis is used
>> to characterize normal behavior, so when a failure occurs, it can be easily
>> spotted. Data mining looks for correlations between separate reported
>> failures. Other researchers have shown that multiple failures are sometimes
>> correlated with each other, because a failure with one technology may affect
>> performance in others, according to Gainaru. For instance, when a network
>> card fails, it will soon hobble other system processes that rely on network
>> communication.
>>
>> The researchers found that 70 percent of correlated failures provide a window
>> of opportunity of more than 10 seconds. In other words, when the first sign
>> of a failure has been detected, the system may have up to 10 seconds to save
>> its work, or move the work to another node, before a more critical failure
>> occurs. "Failure prediction can be merged with other fault-tolerance
>> techniques," Gainaru said.
>>
>> Joab Jackson covers enterprise software and general technology breaking news
>> for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's
>> e-mail address is Joab_Jackson at idg.com
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>



More information about the Beowulf mailing list