[Beowulf] Jeff Squayres MPI proposals
Christopher Samuel
samuel at unimelb.edu.au
Mon Mar 7 15:01:57 PST 2016
On 08/03/16 00:32, John Hearns wrote:
> Us old style guys are going to have our lunch money stolen by young
> upstarts. Or is that startups?
This presumes that everyone is going to be running massive clusters at
huge scale with completely new codes.
That might be true for a few large labs, but I suspect a lot of other
sites are going to be running older, smaller systems with existing codes
that will never get completely rewritten and someone will have to keep
them running.
> Seriously - these guys know how to keep things running at scale and how
> to tolerate failures.
As I mentioned in another thread the Slurm folks are already working on
that issue through their nonstop plugin which is intended to let jobs
bargain with the scheduler on how to react to failure.
http://slurm.schedmd.com/nonstop.html
Of course the user codes have to know what to do when something breaks
too (and I don't mean SEGV)...
All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
More information about the Beowulf
mailing list