[Beowulf] Exascale by the end of the year?

Wed Mar 5 13:07:00 PST 2014

On 03/05/2014 11:07 AM, Joe Landman wrote:
> On 03/05/2014 10:55 AM, Douglas Eadline wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> On 05/03/14 13:52, Joe Landman wrote:
>>>
>>>> I think the real question is would the system be viable in a
>>>> commercial sense, or is this another boondoggle?
>>>
>>> At the Slurm User Group last year Dona Crawford of LLNL gave the
>>> keynote and as part of that talked about some of the challenges of
>>> exascale.
>>>
>>> The one everyone thinks about first is power, but the other one she
>>> touched on was reliability and uptime.
>>
>> Indeed, the fact that these issues were not even mentioned
>> means to me the project is not very well thought out.
>> At exascale (using current tech) failure recovery must be built
>> into any design, either software and/or hardware.
>
> Yes ... such designs must assume that there will be failure, and 
> manage this.  The issue, last I checked, is that most people coding to 
> MPI can't use, or haven't used MPI resiliency features.
>
> Checkpoint/restart (CPR) on this scale is simply not an option, given 
> that the probability of a failure occurring during CPR very rapidly 
> approaches unity.  CPR is built with this implicit assumption that 
> copy out/copy back is *absolutely* reliable and will not fail.  Ever.
>
> One way to circumvent portions of the issue are to use the SSD on DIMM 
> designs to do very local "snapshot"-like CPR.  And add in erasure 
> coding, and other FEC for the data.  So you can accept some small 
> amount of failure in the copy out or copy back.
>
>>
>>>
>>> Basically if you scale a current petascale system up to exascale you
>>> are looking at an expected full-system uptime of between seconds and
>>> minutes.  For comparison Sequoia, their petaflop BG/Q, has a
>>> systemwide MTBF of about a day.
>>
>> I recall that HPL will take about 6 days to run
>> on an exascale machine.
>>
>>>
>>> That causes problems if you're expecting to do checkpoint/restart to
>>> cope with failures, so really you've got to look at fault tolerances
>>> within applications themselves.   Hands up if you've got (or know of)
>>> a code that can gracefully tolerate and meaningfully continue if nodes
>>> going away whilst the job is running?
>>
>> I would hate to have my $50B machine give me a the wrong answer
>> when such large amounts of money are involved. And we all know
>> it is going to kick out "42" at some point.
>
> Or the complete works of Shakespeare 
> (http://en.wikipedia.org/wiki/Infinite_monkey_theorem), though this 
> would be more troubling than 42.
>
>>
>>
>>>
>>> The Slurm folks is already looking at this in terms of having some way
>>> of setting up a bargaining with the scheduler in case of node failure
>>
>> As a side point, the Hadoop YARN scheduler allows dynamic resource
>> negotiations while the program is running, thus if a node or rack dies,
>> a job can request more resources. For MR this rather easy to do 
>> because of
>> the functional nature of the process.
>>
>
> We need to get to that place.  Right now, our job scheduling, while 
> quite sophisticated in rule sets, is firmly entrenched in ideas from 
> the 70's and 80's.  "New" concepts in (pub sub, etc.) schedulers are 
> needed for really huge scale.  Fully distributed, able to route around 
> failure.   Not merely tolerate it, but adapt to it.
>
> This is going to require that we code to reality, not a fictional 
> universe where nodes never fail, storage/networking never goes offline 
> ...
>
> I've not done much with MPI in a few years, have they extended it 
> beyond MPI_Init yet?  Can MPI procs just join a "borgified" 
> collective, preserve state so restarts/moves/reschedules of ranks are 
> cheap?  If not, what is the replacement for MPI that will do this?

I believe that MPI 3.0 allows for dynamic increase/decrease of a 
communicator for stuff like this. I have the full standard in book form, 
but I haven't had enough insomnia yet to read it.  Even if it does, 
there'd still be a lot of logic needed to figure out where the crashed 
node(s) left off.

>
> FWIW, folks on Wall Street use pub sub, message passing (ala AMPS, 
> *MQ, ...) to handle some elements of this.

--
Prentice