disadvantages of a linux cluster

Jim Lux James.P.Lux at jpl.nasa.gov
Wed Nov 6 16:21:32 PST 2002


>
>   b) Uptime, measured as (total time systems are booted into the OS and
>available for numerical tasks/total mount of time ALL systems have been
>around).
>
>This means that if you have 9 systems booted and a hot spare, the best
>you can count for uptime is 90%.  It also means that if a system crashes
>in the middle of the night and you don't get around to fixing it until
>the next day, you lose eight or twelve hours, not the ten minutes it
>eventually takes you to fix it after discovering the crash, pulling the


If the cluster were claimed to have 9 processors worth of processing 
capability, and the OS and scheduler allow transparent use of the hot 
spare, then, you could get 100% uptime as long as you only had 1 failure.

One could implement this in two generalized ways: 10 processors each 
running at 90% (9 processors worth of "work", that is), or 9 processors 
running full tilt, with one sitting idle.  There are performance and 
reliability variations (running full tilt runs hotter, which increases 
failure probability.., but then, the idle unit is relatively cold, and 
isn't "consuming life")...


In an extreme case, say you had mirroring, two complete copies of the 
cluster, running in parallel.. it's not efficient (by some metric) but it 
is potentially highly reliable, although as the Pfail of a given cluster 
goes up, it starts to be worth the increased cost (computationally) of a 
finer grain redundancy (i.e. the per node overhead goes up, but you 
compensate by using more nodes)

Interestingly, Cornell has produced some interesting work in this area 
(Spinglass, for instance), but it's unclear whether it's being used in a 
production environment.




More information about the Beowulf mailing list