[Beowulf] New member, upgrading our existing Beowulf cluster

Thu Dec 3 22:30:58 PST 2009

>>> E.g. you see a system disk going bad, but the user
>>> will lose all their output unless the job runs for
>>> 4 more weeks...

until fairly recently (sometime this year), we didn't constrain
the length of jobs.  we now have a 1 week limit - generally 
argued on the basis of expecting longer jobs to checkpoint.
we also provide blcr for serial/threaded jobs.

I have mixed feelings about this.  the purpose of organizations 
providing HPC is to _enable_, not obstruct.  in some cases, this 
could mean working with a group to find an alternative better than,
for instance, not checkpointing a resource-intensive job.

our node/power failure rates are pretty low - not enough to justify
a 1-week limit.  but to he honest, the main motive is probably to 
increase cluster churn - essentially improving scheduler fairness.

> It's not inevitable that the policy be that 3 month jobs are allowed.

if a length limit is to be justified based on probability-of-failure, 
it should be ~ 1/nnodes; if fail-cost-based, 1/ncpus.  unfortunately, the
other extreme would be a sort of "invisible hand" where users experimentally
derive the failure rate by their rate of failed jobs jobs ;(

personally, I think facilities should permit longer jobs, though perhaps
only after discussing the risks and alternatives.  an economic approach 
might reward checkpointing with a fairshare bonus - merely rewarding 
short jobs seems wrong-headed.