Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

How do you keep clusters running....

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Roger L. Smith roger at ERC.MsState.Edu
Wed Apr 3 13:23:44 PST 2002


I don't know how to say this without sounding condescending, but we
resolved this problem by purchasing high quality machines.  We currently
use IBM x330s (although I also had good luck with our SGI 1100's before
SGI discontinued them).  We have enough nodes on hand, that IBM has
stocked a couple of spare motherboards, power supplies, etc., but we don't
need them that often.  I've never had a fan failure.

In general, hardware problems are a very minor part of the care and
feeding of our cluster.



On Wed, 3 Apr 2002, Cris Rhea wrote:

>
> What are folks doing about keeping hardware running on large clusters?
>
> Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
>
> Sure seems like every week or two, I notice dead fans (each RS-1200
> has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
>
> My last fan failure was a CPU fan that toasted the CPU and motherboard.
>
> How are folks with significantly more nodes than mine dealing with constant
> maintenance on their nodes?  Do you have whole spare nodes sitting around-
> ready to be installed if something fails, or do you have a pile of
> spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> to supply a stockpile of warranty parts?
>
> One of the problems I'm facing is that every time something croaks,
> Racksaver is very good about replacing it under warranty, but getting
> the new parts delivered usually takes several days.
>
> For some things like fans, they sent extras for me to keep on-hand.
>
> For my last fan/CPU/motherboard failure, the node pair will be
> down ~5 days waiting for parts.
>
> Comments? Thoughts? Ideas?
>
> Thanks-
>
> --- Cris
>
>
>
> ----
>   Cristopher J. Rhea                      Mayo Foundation
>   Research Computing Facility              Pavilion 2-25
>   crhea at Mayo.EDU                        Rochester, MN 55905
>   Fax: (507) 266-4486                     (507) 284-0587
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


 _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_
| Roger L. Smith                        Phone: 662-325-3625               |
| Systems Administrator                 FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|_______________________Engineering Research Center_______________________|




More information about the Beowulf mailing list