[Beowulf] Exascale by the end of the year?

Tue Mar 4 19:58:40 PST 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/03/14 13:52, Joe Landman wrote:

> I think the real question is would the system be viable in a
> commercial sense, or is this another boondoggle?

At the Slurm User Group last year Dona Crawford of LLNL gave the
keynote and as part of that talked about some of the challenges of
exascale.

The one everyone thinks about first is power, but the other one she
touched on was reliability and uptime.

Basically if you scale a current petascale system up to exascale you
are looking at an expected full-system uptime of between seconds and
minutes.  For comparison Sequoia, their petaflop BG/Q, has a
systemwide MTBF of about a day.

That causes problems if you're expecting to do checkpoint/restart to
cope with failures, so really you've got to look at fault tolerances
within applications themselves.   Hands up if you've got (or know of)
a code that can gracefully tolerate and meaningfully continue if nodes
going away whilst the job is running?

The Slurm folks is already looking at this in terms of having some way
of setting up a bargaining with the scheduler in case of node failure
- - there are slides up on what they are planning here:

http://slurm.schedmd.com/SUG13/nonstop.pdf

cheers,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlMWoPAACgkQO2KABBYQAh9GiACglcTBFXQt4/3wsL78eRrkILeh
/U8An07MTFVBsX4nssNq7GXZirWuIDii
=Ttyf
-----END PGP SIGNATURE-----