[Beowulf] Jeff Squayres MPI proposals

Christopher Samuel samuel at unimelb.edu.au
Thu Mar 3 15:30:23 PST 2016

On 04/03/16 06:40, Douglas Eadline wrote:

> Yes, failure needs to be option.

The Slurm folks have been working on failure management support for a
little while, the idea being you can have a pool of spare nodes to pick
from (or alternatively bargain with a scheduler for a node that's
currently busy to come free later on and then add it to the job,
potentially extending the walltime to make up for the shortfall).

A better description from someone with higher caffeination is here:


All the best,
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

More information about the Beowulf mailing list