[Beowulf] Exascale by the end of the year?
samuel at unimelb.edu.au
Tue Mar 4 19:58:40 PST 2014
-----BEGIN PGP SIGNED MESSAGE-----
On 05/03/14 13:52, Joe Landman wrote:
> I think the real question is would the system be viable in a
> commercial sense, or is this another boondoggle?
At the Slurm User Group last year Dona Crawford of LLNL gave the
keynote and as part of that talked about some of the challenges of
The one everyone thinks about first is power, but the other one she
touched on was reliability and uptime.
Basically if you scale a current petascale system up to exascale you
are looking at an expected full-system uptime of between seconds and
minutes. For comparison Sequoia, their petaflop BG/Q, has a
systemwide MTBF of about a day.
That causes problems if you're expecting to do checkpoint/restart to
cope with failures, so really you've got to look at fault tolerances
within applications themselves. Hands up if you've got (or know of)
a code that can gracefully tolerate and meaningfully continue if nodes
going away whilst the job is running?
The Slurm folks is already looking at this in terms of having some way
of setting up a bargaining with the scheduler in case of node failure
- - there are slides up on what they are planning here:
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
-----END PGP SIGNATURE-----
More information about the Beowulf