[Beowulf] Exascale by the end of the year?

Wed Mar 5 08:07:28 PST 2014

On 03/05/2014 10:55 AM, Douglas Eadline wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 05/03/14 13:52, Joe Landman wrote:
>>
>>> I think the real question is would the system be viable in a
>>> commercial sense, or is this another boondoggle?
>>
>> At the Slurm User Group last year Dona Crawford of LLNL gave the
>> keynote and as part of that talked about some of the challenges of
>> exascale.
>>
>> The one everyone thinks about first is power, but the other one she
>> touched on was reliability and uptime.
>
> Indeed, the fact that these issues were not even mentioned
> means to me the project is not very well thought out.
> At exascale (using current tech) failure recovery must be built
> into any design, either software and/or hardware.

Yes ... such designs must assume that there will be failure, and manage 
this.  The issue, last I checked, is that most people coding to MPI 
can't use, or haven't used MPI resiliency features.

Checkpoint/restart (CPR) on this scale is simply not an option, given 
that the probability of a failure occurring during CPR very rapidly 
approaches unity.  CPR is built with this implicit assumption that copy 
out/copy back is *absolutely* reliable and will not fail.  Ever.

One way to circumvent portions of the issue are to use the SSD on DIMM 
designs to do very local "snapshot"-like CPR.  And add in erasure 
coding, and other FEC for the data.  So you can accept some small amount 
of failure in the copy out or copy back.

>
>>
>> Basically if you scale a current petascale system up to exascale you
>> are looking at an expected full-system uptime of between seconds and
>> minutes.  For comparison Sequoia, their petaflop BG/Q, has a
>> systemwide MTBF of about a day.
>
> I recall that HPL will take about 6 days to run
> on an exascale machine.
>
>>
>> That causes problems if you're expecting to do checkpoint/restart to
>> cope with failures, so really you've got to look at fault tolerances
>> within applications themselves.   Hands up if you've got (or know of)
>> a code that can gracefully tolerate and meaningfully continue if nodes
>> going away whilst the job is running?
>
> I would hate to have my $50B machine give me a the wrong answer
> when such large amounts of money are involved. And we all know
> it is going to kick out "42" at some point.

Or the complete works of Shakespeare 
(http://en.wikipedia.org/wiki/Infinite_monkey_theorem), though this 
would be more troubling than 42.

>
>
>>
>> The Slurm folks is already looking at this in terms of having some way
>> of setting up a bargaining with the scheduler in case of node failure
>
> As a side point, the Hadoop YARN scheduler allows dynamic resource
> negotiations while the program is running, thus if a node or rack dies,
> a job can request more resources. For MR this rather easy to do because of
> the functional nature of the process.
>

We need to get to that place.  Right now, our job scheduling, while 
quite sophisticated in rule sets, is firmly entrenched in ideas from the 
70's and 80's.  "New" concepts in (pub sub, etc.) schedulers are needed 
for really huge scale.  Fully distributed, able to route around failure. 
   Not merely tolerate it, but adapt to it.

This is going to require that we code to reality, not a fictional 
universe where nodes never fail, storage/networking never goes offline ...

I've not done much with MPI in a few years, have they extended it beyond 
MPI_Init yet?  Can MPI procs just join a "borgified" collective, 
preserve state so restarts/moves/reschedules of ranks are cheap?  If 
not, what is the replacement for MPI that will do this?

FWIW, folks on Wall Street use pub sub, message passing (ala AMPS, *MQ, 
...) to handle some elements of this.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615