At the Slurm User Group last year Dona Crawford of LLNL gave the
keynote and as part of that talked about some of the challenges of

The one everyone thinks about first is power, but the other one she
touched on was reliability and uptime.

Basically if you scale a current petascale system up to exascale you
are looking at an expected full-system uptime of between seconds and
minutes.  For comparison Sequoia, their petaflop BG/Q, has a
systemwide MTBF of about a day.

That causes problems if you're expecting to do checkpoint/restart to
cope with failures, so really you've got to look at fault tolerances
within applications themselves.   Hands up if you've got (or know of)
a code that can gracefully tolerate and meaningfully continue if nodes
going away whilst the job is running?

The Slurm folks is already looking at this in terms of having some way
of setting up a bargaining with the scheduler in case of node failure
- - there are slides up on what they are planning here:


