Reliability analysis was RE: Windows HPC (@ Cornell)
Greg Lindahl
lindahl at keyresearch.com
Thu Nov 7 20:19:08 PST 2002
On Thu, Nov 07, 2002 at 06:34:47PM -0500, Tim Wait wrote:
> One aspect I haven't seen mentioned in this thread, except for
> Greg's oblique reference to Mosix, is that many (most?)
> of our clusters run parallel apps. Regardless of HA, if you have
> a node fail while running a parallel job, you have just blown your
> (supposed) 5 nines away; in my experience, it takes the user O(12+ hours)
> to restart the job. Is this deteriorating to HA vice beowulf?
It's not that hard for queue systems like PBS to detect and restart
jobs that fail due to machines dying -- this is a major quality of
implementation issue.
It still hurts you utilization, because you have wasted resources. But
at least the user doesn't have to do anything to get their answer;
they just get it later.
-- greg
More information about the Beowulf
mailing list