[Beowulf] Re: failure trends in a large disk drive population
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduThu Feb 22 12:30:21 PST 2007
- Previous message: [Beowulf] Re: failure trends in a large disk drive population
- Next message: [Beowulf] Re: failure trends in a large disk drive population
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jim Lux wrote: > >Now we need to know exactly how you defined "failed". > > The paper defined failed as "requiring the computer to be pulled" > whether or not the disk was actually dead. That was sort of my point, if you're looking for indicators that lead to "failed disk" there should be a precise definition of what "failed disk" is. How am I to know what criteria Google uses for classifying a machine as nonfunctioning? If the system is pulled because the CPU blew up it's one thing, but if they pulled it for any disk related reason, we need to know how bad "bad" was. > I would make the case that it's not worth it to even glance at the > outside of the case of a dead unit, much less do failure analysis on > the power supply. FA is expensive, new computers are not. Pitch the > dead (or "not quite dead yet, but suspect") computer, slap in a new > one and go on. Well, they cared enough to do the study! I think the heart of the problem is that disk failures are a bit like airplane crashes: everything looks great until something snaps and then the plane goes down shortly thereafter. Similarly, there's just not that much time between the cause of the failure manifesting itself and the final disk failure. Once the disk heads start bouncing off the disk, or some piece of dirt or metal shaving gets between the disks and the heads, its all over pretty quickly. Until that point there may be a few weak indications that something is wrong, but they may or may not have a relation to the final failure event. For instance, a tiny bit of junk stuck to the surface may cause a few blocks to remap and never do anything else. It might or might not mean that a huge chunk of the same stuff is about to wreak havoc. (It's absence is clearly preferred though, since any remapped blocks can result in data loss.) Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] Re: failure trends in a large disk drive population
- Next message: [Beowulf] Re: failure trends in a large disk drive population
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
