Reliability analysis was RE: Windows HPC (@ Cornell)
waitt at saic.com
Thu Nov 7 15:34:47 PST 2002
One aspect I haven't seen mentioned in this thread, except for
Greg's oblique reference to Mosix, is that many (most?)
of our clusters run parallel apps. Regardless of HA, if you have
a node fail while running a parallel job, you have just blown your
(supposed) 5 nines away; in my experience, it takes the user O(12+ hours)
to restart the job. Is this deteriorating to HA vice beowulf?
5 nines? Yeah, right ;)
Even those $50k hand built Cray disks die.
Tim Wait waitt at saic.com
SAIC - Advanced Systems Group
PO Box 41, Sumerduck VA 22742
More information about the Beowulf