Uptime data/studies/anecdotes ... ?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Richard Walsh rbw at ahpcrc.orgTue Apr 2 10:24:22 PST 2002
- Previous message: Call for Papers
- Next message: Uptime data/studies/anecdotes ... ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 2 Apr 2002 10:15:00 Roger Smith wrote:
>We currently run an average of about 75% utilization on our 586 processor
>(293 node) cluster. We probably have about one node per week crash and
>hang for various reasons.
>
>We have occasional problems with memory leaks or PBS hangups which require
>large scale reboots of the cluster. (Actually, PBS just died as I'm typing
>this, but our pbs heartbeat script should restart it automatically in a
>few minutes). I'd say we have to do a full reboot of the cluster about
>every 3-4 months.
>For a bunch of PC hardware running a free OS, this seems like a pretty
>good number to me. It's not in the same class as our Sun servers (nor
>even our SGIs!), but then, none of those systems are this large, either.
Thanks for the estimate. Do you use SCYLD or another pseudo-single-system-
image tool? I assume that 75% is a steady state number ... how long did
it take your group to reach that state? If a full reboot is required
only every 3-4 months then is singel node failure your main source of
cycle loss? Or are other things like inefficient scheduling and lack of
check-point/restart, etc. important?
75% does seem like a reasonably good number.
rbw
- Previous message: Call for Papers
- Next message: Uptime data/studies/anecdotes ... ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
