[Beowulf] Re: failure trends in a large disk drive population (google fileing system)

Mon Feb 19 01:00:26 PST 2007

On 2/19/07, matt jones <jamesjamiejones at aol.com> wrote:

> if one fails there
> are still 3, if another there are still 2. i've also read somewhere else
> that if one fails, it can automatically recreate the image from the
> remaining ones on a spare node.

[...]

>this approach is rather ott, but it works and works well.

not sure of Google gents; but we're using reliability model to
calculate number of nodes and their physical locations (continuous
scheduling) - to meet the expected reliability coefficient specified
by the system operator/deployer/configurator (for EE, SW and HW
parts).

HDD is unreliable system part, with the nearly known reliability
(expected -actually), moreover, as we know, most of HDDs have SMART
metrics - the good way to correct live coefficients within used math
model. The outcome here is to use adaptive techs.
So Googles are using the same way probably - a good company anyhow... ta-da! :)

Scal at Grid – http://sgrid.sourceforge.net/

//
(the perfect doc - the amazing work)