[Beowulf] Supercomputers face growing resilience problems
diep at xs4all.nl
Fri Nov 23 04:13:50 PST 2012
On Nov 23, 2012, at 5:19 AM, Justin YUAN SHI wrote:
> The fundamental problem rests in our programming API. If you look at
> MPI and OpenMP carefully, you will find that these and all others have
> one common assumption: the application-level communication is always
> We knew full well that this cannot be true.
> Thus, the bigger the application we build, the higher probability of
> failure. This should not be a surprise.
Total theoretic talk.
The AMD gpu's have what is it 8KB L1 instruction cache.
Practical spoken you must run 2 threads from that small L1
Xeon Phi has 32KB or so and the instructions probably are huge (4
bytes or so i would blindfolded guess)
so maybe that's 8192 instructions or so (would need to look it up).
Maybe its L2 can store instructions though.
Note from what i understand you must run 4 threads from that 32KB at
Nvidia gives the possibility to divide the L1 cache of 64KB yourself.
Size of the instructions is not known to me.
Possibly it's not too many bytes an instruction. somewhere in the
44KB-48KB is the maximum (giving you 16 KB L1 left
for datacache). You must run 2 threads.
The real problem is writing good code for manycores and nothing else.
See how few students manage to write good GPU code. There isn't even
a chessprogram for a GPU that works well,
as the few who can write such codes only do it paid.
You should be busy finding a way how you can paid experts write good
code for science - as the 1 or 2 researchers who know
how to write such code, it's too few, too little and not enough.
All those years MPI had a similar problem.
See a posting i did do here and there. Asking for a discussion on how
to do memory migration efficiently with MPI.
There was 0 reactions. The few guys who know - they're paid.
*that* is the problem in general - the simple problem of that the
experts, the few who are there, they want to get paid.
By modifying an API you won't solve that problem ever - as the
hardware dictates you need to write for very limited caches
your number crunching code.
> Proposed fault tolerance methods, such as redundant execution, is
> really like "borrow John to pay Paul" where both John and Paul are
> personal friends.
> What we need is a true sustainable solution that can gain performance
> and reliability at the same time as we up scale the application.
> This is NOT an impossible dream. The packet-switching network is a
> living example of such an architecture. The missing piece in HPC
> applications is the principle of statistic multiplexed computing. In
> other words, the application architecture should be considered as a
> whole in the design space, not a "glued" together piece using lower
> layers with unsealed semantic "holes". The semantic "holes" between
> the layers are the real evils for all our troubles.
> Our research exhibit (booth 3360) demonstrate a prototype data
> parallel system using this idea. The Sustainable HPC Cloud Workshop at
> the end of SC12 (Friday AM)) had one paper touching on this topic as
> On Thu, Nov 22, 2012 at 5:03 AM, Eugen Leitl <eugen at leitl.org> wrote:
>> Supercomputers face growing resilience problems
>> As the number of components in large supercomputers grows, so does
>> possibility of component failure
>> Joab Jackson (IDG News Service)
>> 21 November, 2012 21:58
>> As supercomputers grow more powerful, they'll also grow more
>> vulnerable to
>> failure, thanks to the increased amount of built-in componentry. A
>> researchers at the recent SC12 conference, held last week in Salt
>> Lake City,
>> offered possible solutions to this growing problem.
>> Today's high-performance computing (HPC) systems can have 100,000
>> nodes or
>> more -- with each node built from multiple components of memory,
>> buses and other circuitry. Statistically speaking, all these
>> components will
>> fail at some point, and they halt operations when they do so, said
>> Fiala, a Ph.D student at the North Carolina State University,
>> during a talk
>> at SC12.
>> The problem is not a new one, of course. When Lawrence Livermore
>> Laboratory's 600-node ASCI (Accelerated Strategic Computing
>> Initiative) White
>> supercomputer went online in 2001, it had a mean time between
>> failures (MTBF)
>> of only five hours, thanks in part to component failures. Later
>> efforts had improved ASCI White's MTBF to 55 hours, Fiala said.
>> But as the number of supercomputer nodes grows, so will the problem.
>> "Something has to be done about this. It will get worse as we move to
>> exascale," Fiala said, referring to how supercomputers of the next
>> decade are
>> expected to have 10 times the computational power that today's
>> models do.
>> Today's techniques for dealing with system failure may not scale
>> very well,
>> Fiala said. He cited checkpointing, in which a running program is
>> halted and its state is saved to disk. Should the program then
>> crash, the
>> system is able to restart the job from the last checkpoint.
>> The problem with checkpointing, according to Fiala, is that as the
>> number of
>> nodes grows, the amount of system overhead needed to do
>> checkpointing grows
>> as well -- and grows at an exponential rate. On a 100,000-node
>> for example, only about 35 percent of the activity will be
>> involved in
>> conducting work. The rest will be taken up by checkpointing and --
>> should a
>> system fail -- recovery operations, Fiala estimated.
>> Because of all the additional hardware needed for exascale
>> systems, which
>> could be built from a million or more components, system
>> reliability will
>> have to be improved by 100 times in order to keep to the same MTBF
>> today's supercomputers enjoy, Fiala said.
>> Fiala presented technology that he and fellow researchers
>> developed that may
>> help improve reliability. The technology addresses the problem of
>> silent data
>> corruption, when systems make undetected errors writing data to disk.
>> Basically, the researchers' approach consists of running multiple
>> copies, or
>> "clones" of a program, simultaneously and then comparing the
>> answers. The
>> software, called RedMPI, is run in conjunction with the Message
>> Interface (MPI), a library for splitting running applications
>> across multiple
>> servers so the different parts of the program can be executed in
>> RedMPI intercepts and copies every MPI message that an application
>> sends, and
>> sends copies of the message to the clone (or clones) of the
>> program. If
>> different clones calculate different answers, then the numbers can be
>> recalculated on the fly, which will save time and resources from
>> running the
>> entire program again.
>> "Implementing redundancy is not expensive. It may be high in the
>> number of
>> core counts that are needed, but it avoids the need for rewrites with
>> checkpoint restarts," Fiala said. "The alternative is, of course,
>> to simply
>> rerun jobs until you think you have the right answer."
>> Fiala recommended running two backup copies of each program, for
>> redundancy. Though running multiple copies of a program would
>> initially take
>> up more resources, over time it may actually be more efficient,
>> due to the
>> fact that programs would not need to be rerun to check answers. Also,
>> checkpointing may not be needed when multiple copies are run,
>> which would
>> also save on system resources.
>> "I think the idea of doing redundancy is actually a great idea.
>> [For] very
>> large computations, involving hundreds of thousands of nodes,
>> there certainly
>> is a chance that errors will creep in," said Ethan Miller, a
>> computer science
>> professor at the University of California Santa Cruz, who attended
>> presentation. But he said the approach may be not be suitable
>> given the
>> amount of network traffic that such redundancy might create. He
>> running all the applications on the same set of nodes, which could
>> internode traffic.
>> In another presentation, Ana Gainaru, a Ph.D student from the
>> University of
>> Illinois at Urbana-Champaign, presented a technique of analyzing
>> log files to
>> predict when system failures would occur.
>> The work combines signal analysis with data mining. Signal
>> analysis is used
>> to characterize normal behavior, so when a failure occurs, it can
>> be easily
>> spotted. Data mining looks for correlations between separate reported
>> failures. Other researchers have shown that multiple failures are
>> correlated with each other, because a failure with one technology
>> may affect
>> performance in others, according to Gainaru. For instance, when a
>> card fails, it will soon hobble other system processes that rely
>> on network
>> The researchers found that 70 percent of correlated failures
>> provide a window
>> of opportunity of more than 10 seconds. In other words, when the
>> first sign
>> of a failure has been detected, the system may have up to 10
>> seconds to save
>> its work, or move the work to another node, before a more critical
>> occurs. "Failure prediction can be merged with other fault-tolerance
>> techniques," Gainaru said.
>> Joab Jackson covers enterprise software and general technology
>> breaking news
>> for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson.
>> e-mail address is Joab_Jackson at idg.com
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf