How do you keep clusters running....

Cris Rhea crhea at mayo.edu
Wed Apr 3 13:04:12 PST 2002


What are folks doing about keeping hardware running on large clusters?

Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...

Sure seems like every week or two, I notice dead fans (each RS-1200
has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).

My last fan failure was a CPU fan that toasted the CPU and motherboard.

How are folks with significantly more nodes than mine dealing with constant
maintenance on their nodes?  Do you have whole spare nodes sitting around-
ready to be installed if something fails, or do you have a pile of
spare parts?  Did you get the vendor (if you purchased prebuilt systems)
to supply a stockpile of warranty parts?

One of the problems I'm facing is that every time something croaks, 
Racksaver is very good about replacing it under warranty, but getting
the new parts delivered usually takes several days.

For some things like fans, they sent extras for me to keep on-hand.

For my last fan/CPU/motherboard failure, the node pair will be 
down ~5 days waiting for parts.

Comments? Thoughts? Ideas?

Thanks-

--- Cris



----
  Cristopher J. Rhea                      Mayo Foundation
  Research Computing Facility              Pavilion 2-25
  crhea at Mayo.EDU                        Rochester, MN 55905
  Fax: (507) 266-4486                     (507) 284-0587



More information about the Beowulf mailing list