Uptime data/studies/anecdotes ... ?

Richard Walsh rbw at ahpcrc.org
Tue Apr 2 10:24:22 PST 2002

On Tue, 2 Apr 2002 10:15:00 Roger Smith wrote:

>We currently run an average of about 75% utilization on our 586 processor
>(293 node)  cluster.  We probably have about one node per week crash and
>hang for various reasons.
>We have occasional problems with memory leaks or PBS hangups which require
>large scale reboots of the cluster. (Actually, PBS just died as I'm typing
>this, but our pbs heartbeat script should restart it automatically in a
>few minutes).  I'd say we have to do a full reboot of the cluster about
>every 3-4 months.
                                                                                >For a bunch of PC hardware running a free OS, this seems like a pretty
>good number to me.  It's not in the same class as our Sun servers (nor
>even our SGIs!), but then, none of those systems are this large, either.

Thanks for the estimate.  Do you use SCYLD or another pseudo-single-system-
image tool? I assume that 75% is a steady state number ... how long did
it take your group to reach that state?  If a full reboot is required 
only every 3-4 months then is singel node failure your main source of 
cycle loss? Or are other things like inefficient scheduling and lack of 
check-point/restart, etc. important?

75% does seem like a reasonably good number.


More information about the Beowulf mailing list