Reliability analysis was RE: Windows HPC (@ Cornell)

Thu Nov 7 20:19:08 PST 2002

On Thu, Nov 07, 2002 at 06:34:47PM -0500, Tim Wait wrote:

> One aspect I haven't seen mentioned in this thread, except for
> Greg's oblique reference to Mosix, is that many (most?)
> of our clusters run parallel apps. Regardless of HA, if you have
> a node fail while running a parallel job, you have just blown your
> (supposed) 5 nines away; in my experience, it takes the user O(12+ hours)
> to restart the job. Is this deteriorating to HA vice beowulf?

It's not that hard for queue systems like PBS to detect and restart
jobs that fail due to machines dying -- this is a major quality of
implementation issue.

It still hurts you utilization, because you have wasted resources. But
at least the user doesn't have to do anything to get their answer;
they just get it later.

-- greg