[Beowulf] Jeff Squayres MPI proposals

Christopher Samuel samuel at unimelb.edu.au
Mon Mar 7 15:01:57 PST 2016


On 08/03/16 00:32, John Hearns wrote:

> Us old style guys are going to have our lunch money stolen by young
> upstarts. Or is that startups?

This presumes that everyone is going to be running massive clusters at
huge scale with completely new codes.

That might be true for a few large labs, but I suspect a lot of other
sites are going to be running older, smaller systems with existing codes
that will never get completely rewritten and someone will have to keep
them running.

> Seriously - these guys know how to keep things running at scale and how
> to tolerate failures.

As I mentioned in another thread the Slurm folks are already working on
that issue through their nonstop plugin which is intended to let jobs
bargain with the scheduler on how to react to failure.

http://slurm.schedmd.com/nonstop.html

Of course the user codes have to know what to do when something breaks
too (and I don't mean SEGV)...

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci



More information about the Beowulf mailing list