How do you keep clusters running....

Leandro Tavares Carneiro leandro at ep.petrobras.com.br
Thu Apr 4 05:12:54 PST 2002


We have here an beowulf cluster with 64 production nodes and 128
processors, and we have some problems like you, about fans.
Here, our cluster hardware is very cheap, using motherboards and cases
founds easily in the local market, and the problems is critical.
We have 5 spare nodes, and only 3 of that are ready to work. All our
production nodes and the 3 spare nodes which are read to start are an
dual PIII 1GHz, the other 2 spare nodes are an dual PIII 800MHz but this
processors are slot 1 (SECC2) and we have one node down because we dont
find coolers for this! The cooler vendors say they not producing anymore
SECC2 coolers, and i am studying how can i adapt others fans in that
coolers... this is sad but true.
We have a lot of problems with memory, hard disks and other parts. A 3
months ago, our cluster nodes was one PIII 500 MHz per node, and after
the upgrade to dual 1GHz we now have lots of memory and spare disks.

I think this kind of problem is inevitable with cheap PC parts, and can
be lower with high-quality (and price) parts. We are making an study to
by a new cluster, for another application and we call Compaq and IBM to
see what they have in hardware and software, with the hope of a future
with less problems...

Regards, and sorry about my poor english, i am brazilian and speak
portuguese... 

Em Qua, 2002-04-03 às 18:04, Cris Rhea escreveu:
> 
> What are folks doing about keeping hardware running on large clusters?
> 
> Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
> 
> Sure seems like every week or two, I notice dead fans (each RS-1200
> has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
> 
> My last fan failure was a CPU fan that toasted the CPU and motherboard.
> 
> How are folks with significantly more nodes than mine dealing with constant
> maintenance on their nodes?  Do you have whole spare nodes sitting around-
> ready to be installed if something fails, or do you have a pile of
> spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> to supply a stockpile of warranty parts?
> 
> One of the problems I'm facing is that every time something croaks, 
> Racksaver is very good about replacing it under warranty, but getting
> the new parts delivered usually takes several days.
> 
> For some things like fans, they sent extras for me to keep on-hand.
> 
> For my last fan/CPU/motherboard failure, the node pair will be 
> down ~5 days waiting for parts.
> 
> Comments? Thoughts? Ideas?
> 
> Thanks-
> 
> --- Cris
> 
> 
> 
> ----
>   Cristopher J. Rhea                      Mayo Foundation
>   Research Computing Facility              Pavilion 2-25
>   crhea at Mayo.EDU                        Rochester, MN 55905
>   Fax: (507) 266-4486                     (507) 284-0587
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 
Leandro Tavares Carneiro
Analista de Suporte
EP-CORP/TIDT/INFI
Telefone: 2534-1427




More information about the Beowulf mailing list