[Beowulf] Exascale by the end of the year?
Prentice Bisbal
prentice.bisbal at rutgers.edu
Wed Mar 5 13:07:00 PST 2014
On 03/05/2014 11:07 AM, Joe Landman wrote:
> On 03/05/2014 10:55 AM, Douglas Eadline wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> On 05/03/14 13:52, Joe Landman wrote:
>>>
>>>> I think the real question is would the system be viable in a
>>>> commercial sense, or is this another boondoggle?
>>>
>>> At the Slurm User Group last year Dona Crawford of LLNL gave the
>>> keynote and as part of that talked about some of the challenges of
>>> exascale.
>>>
>>> The one everyone thinks about first is power, but the other one she
>>> touched on was reliability and uptime.
>>
>> Indeed, the fact that these issues were not even mentioned
>> means to me the project is not very well thought out.
>> At exascale (using current tech) failure recovery must be built
>> into any design, either software and/or hardware.
>
> Yes ... such designs must assume that there will be failure, and
> manage this. The issue, last I checked, is that most people coding to
> MPI can't use, or haven't used MPI resiliency features.
>
> Checkpoint/restart (CPR) on this scale is simply not an option, given
> that the probability of a failure occurring during CPR very rapidly
> approaches unity. CPR is built with this implicit assumption that
> copy out/copy back is *absolutely* reliable and will not fail. Ever.
>
> One way to circumvent portions of the issue are to use the SSD on DIMM
> designs to do very local "snapshot"-like CPR. And add in erasure
> coding, and other FEC for the data. So you can accept some small
> amount of failure in the copy out or copy back.
>
>>
>>>
>>> Basically if you scale a current petascale system up to exascale you
>>> are looking at an expected full-system uptime of between seconds and
>>> minutes. For comparison Sequoia, their petaflop BG/Q, has a
>>> systemwide MTBF of about a day.
>>
>> I recall that HPL will take about 6 days to run
>> on an exascale machine.
>>
>>>
>>> That causes problems if you're expecting to do checkpoint/restart to
>>> cope with failures, so really you've got to look at fault tolerances
>>> within applications themselves. Hands up if you've got (or know of)
>>> a code that can gracefully tolerate and meaningfully continue if nodes
>>> going away whilst the job is running?
>>
>> I would hate to have my $50B machine give me a the wrong answer
>> when such large amounts of money are involved. And we all know
>> it is going to kick out "42" at some point.
>
> Or the complete works of Shakespeare
> (http://en.wikipedia.org/wiki/Infinite_monkey_theorem), though this
> would be more troubling than 42.
>
>>
>>
>>>
>>> The Slurm folks is already looking at this in terms of having some way
>>> of setting up a bargaining with the scheduler in case of node failure
>>
>> As a side point, the Hadoop YARN scheduler allows dynamic resource
>> negotiations while the program is running, thus if a node or rack dies,
>> a job can request more resources. For MR this rather easy to do
>> because of
>> the functional nature of the process.
>>
>
> We need to get to that place. Right now, our job scheduling, while
> quite sophisticated in rule sets, is firmly entrenched in ideas from
> the 70's and 80's. "New" concepts in (pub sub, etc.) schedulers are
> needed for really huge scale. Fully distributed, able to route around
> failure. Not merely tolerate it, but adapt to it.
>
> This is going to require that we code to reality, not a fictional
> universe where nodes never fail, storage/networking never goes offline
> ...
>
> I've not done much with MPI in a few years, have they extended it
> beyond MPI_Init yet? Can MPI procs just join a "borgified"
> collective, preserve state so restarts/moves/reschedules of ranks are
> cheap? If not, what is the replacement for MPI that will do this?
I believe that MPI 3.0 allows for dynamic increase/decrease of a
communicator for stuff like this. I have the full standard in book form,
but I haven't had enough insomnia yet to read it. Even if it does,
there'd still be a lot of logic needed to figure out where the crashed
node(s) left off.
>
> FWIW, folks on Wall Street use pub sub, message passing (ala AMPS,
> *MQ, ...) to handle some elements of this.
--
Prentice
More information about the Beowulf
mailing list