[Beowulf] Exascale by the end of the year?

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Wed Mar 5 13:22:43 PST 2014

On 3/5/14 7:55 AM, "Douglas Eadline" <deadline at eadline.org> wrote:

>> Hash: SHA1
>> On 05/03/14 13:52, Joe Landman wrote:
>>> I think the real question is would the system be viable in a
>>> commercial sense, or is this another boondoggle?
>> At the Slurm User Group last year Dona Crawford of LLNL gave the
>> keynote and as part of that talked about some of the challenges of
>> exascale.
>> The one everyone thinks about first is power, but the other one she
>> touched on was reliability and uptime.

Shades of ENIAC or the Q7 and continuous replacement of vacuum tubes.

>Indeed, the fact that these issues were not even mentioned
>means to me the project is not very well thought out.
>At exascale (using current tech) failure recovery must be built
>into any design, either software and/or hardware.

Failure tolerance, rather than recovery.  Your algorithms need to expect
failures/errors and just keep on going.

It's like network comms.  TCP works great, until there's significant
packet loss, then the retries start to pile up, and the packet loss
affects the retries, etc.    If, instead, you do forward error correction,
you give up a bit of performance/bandwidth, but you also get much better
behaved performance with loss/erasures

>> Basically if you scale a current petascale system up to exascale you
>> are looking at an expected full-system uptime of between seconds and
>> minutes.  For comparison Sequoia, their petaflop BG/Q, has a
>> systemwide MTBF of about a day.
>I recall that HPL will take about 6 days to run
>on an exascale machine.
>> That causes problems if you're expecting to do checkpoint/restart to
>> cope with failures, so really you've got to look at fault tolerances
>> within applications themselves.   Hands up if you've got (or know of)
>> a code that can gracefully tolerate and meaningfully continue if nodes
>> going away whilst the job is running?
>I would hate to have my $50B machine give me a the wrong answer
>when such large amounts of money are involved. And we all know
>it is going to kick out "42" at some point.

And this is a fundamental problem that many of these applications face.
When error rates are low, a strategy of checksums (e.g. That's what double
entry book-keeping is all about) works pretty well, because the cost of
recovery from a "detection", while large, is a small fraction of overall

If a bank makes an error in 1 out of 10 million checking or credit card
accounts per day, that's a few dozen events, where the correction is
localized to a fairly small area of data. The financial system is actually
quite well designed to "unwind" transactions, but it is a manual process.

But if the error rate climbed to 1 out of 1000 accounts in any given day,
the system would collapse.


More information about the Beowulf mailing list