[Beowulf] Exascale by the end of the year?

Wed Mar 5 07:55:31 PST 2014

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/03/14 13:52, Joe Landman wrote:
>
>> I think the real question is would the system be viable in a
>> commercial sense, or is this another boondoggle?
>
> At the Slurm User Group last year Dona Crawford of LLNL gave the
> keynote and as part of that talked about some of the challenges of
> exascale.
>
> The one everyone thinks about first is power, but the other one she
> touched on was reliability and uptime.

Indeed, the fact that these issues were not even mentioned
means to me the project is not very well thought out.
At exascale (using current tech) failure recovery must be built
into any design, either software and/or hardware.

>
> Basically if you scale a current petascale system up to exascale you
> are looking at an expected full-system uptime of between seconds and
> minutes.  For comparison Sequoia, their petaflop BG/Q, has a
> systemwide MTBF of about a day.

I recall that HPL will take about 6 days to run
on an exascale machine.

>
> That causes problems if you're expecting to do checkpoint/restart to
> cope with failures, so really you've got to look at fault tolerances
> within applications themselves.   Hands up if you've got (or know of)
> a code that can gracefully tolerate and meaningfully continue if nodes
> going away whilst the job is running?

I would hate to have my $50B machine give me a the wrong answer
when such large amounts of money are involved. And we all know
it is going to kick out "42" at some point.

>
> The Slurm folks is already looking at this in terms of having some way
> of setting up a bargaining with the scheduler in case of node failure

As a side point, the Hadoop YARN scheduler allows dynamic resource
negotiations while the program is running, thus if a node or rack dies,
a job can request more resources. For MR this rather easy to do because of
the functional nature of the process.

--
Doug

> - - there are slides up on what they are planning here:
>
> http://slurm.schedmd.com/SUG13/nonstop.pdf
>
> cheers,
> Chris
> - --
>  Christopher Samuel        Senior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/      http://twitter.com/vlsci
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlMWoPAACgkQO2KABBYQAh9GiACglcTBFXQt4/3wsL78eRrkILeh
> /U8An07MTFVBsX4nssNq7GXZirWuIDii
> =Ttyf
> -----END PGP SIGNATURE-----
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Mailscanner: Clean
>

--
Doug

-- 
Mailscanner: Clean