[Beowulf] New member, upgrading our existing Beowulf cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caThu Dec 3 22:30:58 PST 2009
- Previous message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Next message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>>> E.g. you see a system disk going bad, but the user >>> will lose all their output unless the job runs for >>> 4 more weeks... until fairly recently (sometime this year), we didn't constrain the length of jobs. we now have a 1 week limit - generally argued on the basis of expecting longer jobs to checkpoint. we also provide blcr for serial/threaded jobs. I have mixed feelings about this. the purpose of organizations providing HPC is to _enable_, not obstruct. in some cases, this could mean working with a group to find an alternative better than, for instance, not checkpointing a resource-intensive job. our node/power failure rates are pretty low - not enough to justify a 1-week limit. but to he honest, the main motive is probably to increase cluster churn - essentially improving scheduler fairness. > It's not inevitable that the policy be that 3 month jobs are allowed. if a length limit is to be justified based on probability-of-failure, it should be ~ 1/nnodes; if fail-cost-based, 1/ncpus. unfortunately, the other extreme would be a sort of "invisible hand" where users experimentally derive the failure rate by their rate of failed jobs jobs ;( personally, I think facilities should permit longer jobs, though perhaps only after discussing the risks and alternatives. an economic approach might reward checkpointing with a fairshare bonus - merely rewarding short jobs seems wrong-headed.
- Previous message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Next message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
