[Beowulf] real hard drive failures

Wed Jan 26 01:50:27 PST 2005

hi ya mark

On Tue, 25 Jan 2005, Mark Hahn wrote:

> > i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm
> > disks 
> 
> I'm pretty dubious of this: adding two 50Khour moving parts to 
> improve the airflow around a 1Mhour moving part which only dissipates
> 10W in the first place?  designing the chassis for proper airflow 
> with minimum fanage is obviously smarter and probably safer.

the purpose of the fan is to keep the hdd temp down, low as possible

- while the disks have a 1M hr MTBF, those disks still fail

- most often, fans fails before anything else, and creates a chain
  reaction in that the item it was cooling will be next to fail

	- you can detect the fan failure ( tach signal ) and
	replace the fan before the hard disk fails 

	- a disk that runs 10C cooler will allow that disk
	to live 2x as long before it dies, given the same
	operating conditions

- there is very very few chassis with proper airflow 
	- those silly carboard aroudn the cpu heatink is silly,
	in that if that one fan dies, the cpu will die

	- if you have 2 or 3 separate fans around it, than it will
	not matter tha one fan died

	- proper airflow has always been the trick to keeping
	the system running for a year or threee
	and "good parts vendors" and "good parts selection"
 	makes all the difference in the world

> > 	- if downtime is important, and should be avoidable, than raid
> > 	is the worst thing, since it's 4x slower to bring back up than
> > 	a single disk failure
> 
> eh?  you have a raid which is not operational while rebuilding?

if the raid is in degraded mode ... you do NOT have "raid"

if it's resyncing ... you do NOT have raid ..

if another disk dies while its operating in degraded mode or
during resync ... you have a very high possibility that the 
whole raid array is toast

it'd just depends on why and how it failed

> > 	- raid will NOT prevent your downtime, as that raid box
> > 	will have to be shutdown sooner or later 
> > 	( shutting down sooner ( asap )  prevents data loss )
> 
> huh?  hotspares+hotplug=zero downtime.

you're assuming that "hot plug" work as its supposed to
	- i usually get the phone calls after the raid didnt
	do its magic for some odd reason

hotspare should work by shutting down (hotremove) the failed disks
and hotadding the previously idle/unused hotspare

or in the case of hw raid.. jsut pull the disk and plug in a new one

> but yes, treating whole servers as your hotspare+hotplug element is 
> a nice optimization, since hotplug ethernet is pretty cheap vs 
> $50 hotplug caddies for each and every disk ;)

i like/require redundnacy of the entire system .. not just a disk
	- a complete 2nd independent system 
	( 2nd pw, 2nd mb, 2nd cpu, 2nd memory, 2nd disks, 2nd nic )

		- but if the t1/router/switch does down .. oh well,
		but that too is cheap to get a 2nd backup

	( its cheap compared to downtime, where its important )

i think it/these, covers michael's replies too

c ya
alvin