[Beowulf] New member, upgrading our existing Beowulf cluster

Chris Samuel csamuel at vpac.org
Thu Dec 3 18:32:12 PST 2009

----- "Greg Lindahl" <lindahl at pbm.com> wrote:

> That kind of policy has a fairly high opportunity
> cost, even before you factor in linked nodes.

Well we cannot dictate to our users what they do,
we set a maximum walltime of 3 months and tell users
that they should checkpoint (if they have control of
the application and have coding skills).

> E.g. you see a system disk going bad, but the user
> will lose all their output unless the job runs for
> 4 more weeks...

We run SMART tests and the like trying to proactively
spot bad disks (and other hardware) prior to failures,
but yes, that's inevitable.

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

More information about the Beowulf mailing list