[Beowulf] Supercomputers face growing resilience problems
eugen at leitl.org
Thu Nov 22 02:03:19 PST 2012
Supercomputers face growing resilience problems
As the number of components in large supercomputers grows, so does the
possibility of component failure
Joab Jackson (IDG News Service)
21 November, 2012 21:58
As supercomputers grow more powerful, they'll also grow more vulnerable to
failure, thanks to the increased amount of built-in componentry. A few
researchers at the recent SC12 conference, held last week in Salt Lake City,
offered possible solutions to this growing problem.
Today's high-performance computing (HPC) systems can have 100,000 nodes or
more -- with each node built from multiple components of memory, processors,
buses and other circuitry. Statistically speaking, all these components will
fail at some point, and they halt operations when they do so, said David
Fiala, a Ph.D student at the North Carolina State University, during a talk
The problem is not a new one, of course. When Lawrence Livermore National
Laboratory's 600-node ASCI (Accelerated Strategic Computing Initiative) White
supercomputer went online in 2001, it had a mean time between failures (MTBF)
of only five hours, thanks in part to component failures. Later tuning
efforts had improved ASCI White's MTBF to 55 hours, Fiala said.
But as the number of supercomputer nodes grows, so will the problem.
"Something has to be done about this. It will get worse as we move to
exascale," Fiala said, referring to how supercomputers of the next decade are
expected to have 10 times the computational power that today's models do.
Today's techniques for dealing with system failure may not scale very well,
Fiala said. He cited checkpointing, in which a running program is temporarily
halted and its state is saved to disk. Should the program then crash, the
system is able to restart the job from the last checkpoint.
The problem with checkpointing, according to Fiala, is that as the number of
nodes grows, the amount of system overhead needed to do checkpointing grows
as well -- and grows at an exponential rate. On a 100,000-node supercomputer,
for example, only about 35 percent of the activity will be involved in
conducting work. The rest will be taken up by checkpointing and -- should a
system fail -- recovery operations, Fiala estimated.
Because of all the additional hardware needed for exascale systems, which
could be built from a million or more components, system reliability will
have to be improved by 100 times in order to keep to the same MTBF that
today's supercomputers enjoy, Fiala said.
Fiala presented technology that he and fellow researchers developed that may
help improve reliability. The technology addresses the problem of silent data
corruption, when systems make undetected errors writing data to disk.
Basically, the researchers' approach consists of running multiple copies, or
"clones" of a program, simultaneously and then comparing the answers. The
software, called RedMPI, is run in conjunction with the Message Passing
Interface (MPI), a library for splitting running applications across multiple
servers so the different parts of the program can be executed in parallel.
RedMPI intercepts and copies every MPI message that an application sends, and
sends copies of the message to the clone (or clones) of the program. If
different clones calculate different answers, then the numbers can be
recalculated on the fly, which will save time and resources from running the
entire program again.
"Implementing redundancy is not expensive. It may be high in the number of
core counts that are needed, but it avoids the need for rewrites with
checkpoint restarts," Fiala said. "The alternative is, of course, to simply
rerun jobs until you think you have the right answer."
Fiala recommended running two backup copies of each program, for triple
redundancy. Though running multiple copies of a program would initially take
up more resources, over time it may actually be more efficient, due to the
fact that programs would not need to be rerun to check answers. Also,
checkpointing may not be needed when multiple copies are run, which would
also save on system resources.
"I think the idea of doing redundancy is actually a great idea. [For] very
large computations, involving hundreds of thousands of nodes, there certainly
is a chance that errors will creep in," said Ethan Miller, a computer science
professor at the University of California Santa Cruz, who attended the
presentation. But he said the approach may be not be suitable given the
amount of network traffic that such redundancy might create. He suggested
running all the applications on the same set of nodes, which could minimize
In another presentation, Ana Gainaru, a Ph.D student from the University of
Illinois at Urbana-Champaign, presented a technique of analyzing log files to
predict when system failures would occur.
The work combines signal analysis with data mining. Signal analysis is used
to characterize normal behavior, so when a failure occurs, it can be easily
spotted. Data mining looks for correlations between separate reported
failures. Other researchers have shown that multiple failures are sometimes
correlated with each other, because a failure with one technology may affect
performance in others, according to Gainaru. For instance, when a network
card fails, it will soon hobble other system processes that rely on network
The researchers found that 70 percent of correlated failures provide a window
of opportunity of more than 10 seconds. In other words, when the first sign
of a failure has been detected, the system may have up to 10 seconds to save
its work, or move the work to another node, before a more critical failure
occurs. "Failure prediction can be merged with other fault-tolerance
techniques," Gainaru said.
Joab Jackson covers enterprise software and general technology breaking news
for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's
e-mail address is Joab_Jackson at idg.com
More information about the Beowulf