Disk reliability (Was: Node cloning)
Josip Loncaric
josip at icase.edu
Wed Apr 11 14:38:15 PDT 2001
Greg Lindahl wrote:
>
> > For IBM drives (IDE or SCSI), one can download and use the Drive Fitness
> > Test utility (see
> > http://www.storage.ibm.com/techsup/hddtech/welcome.htm). This program
> > can diagnose typical problems with hard drives. In many cases, bad
> > blocks can be 'healed' by erasing the drive using this utility (back up
> > your data first, and be prepared for the 'Erase Disk' to take an hour or
> > more). If that fails and your drive is under warranty, the drive ought
> > to be replaced.
>
> NOOOOOOOOOOOOOOOOO!
>
> If a sector returns the wrong result 0.01% of the time, it is bad, but
> testing is unlikely to be intensive enough to detect it (10,000
> reads...) If you "heal" it, it will appear to work at first, but it
> will eventually turn up bad again. So all you're doing is papering
> over the problem. You ought to just replace the disk.
Granted, testing once or twice is not perfect, but if that fails, you
can always replace the disk later, which is what IBM suggests. The
suggestion to 'heal' the disk using IBM's Drive Fitness Test comes from
its manual:
http://service.boulder.ibm.com/storage/hddtech/dft32ug.pdf
which says (on pg.27)
"[...] For example if during testing of your hard drive DFT reports a
error code of 0x70 as shown on page 14, this indicates that your hard
disk drive has one or more bad sectors. In most of these cases the
drive can heal itself of these errors. To do this first back-up all
your data from the problem drive (if possible) then run DFT again and
select the Erase Disk option which is under the Utilities heading.
[...] Once erase disk has completed you can then run one of the test
options Quick or Advance to confirm htat the drive has been healed. The
result code, which should be displayed, is 0x00 if the test returns
another code then you should check with your drive/system vendor if the
drive can be return for warranty replacement." (sic!)
Sincerely,
Josip
P.S. I'm guessing that the manufacturer's list of bad blocks (written
at the disk drive factory) is the result of very limited testing (a few
times at most). The drive you return for replacement will be subjected
to more testing (only a few times), and if it passes (with an updated
list of bad blocks) it will probably be used as a refurbished drive,
replacing the ones people sent in for replacement... No manufacturer
can afford to perform 10,000 reads of an entire 30GB drive since that
would take at least 6 months per drive...
--
Dr. Josip Loncaric, Research Fellow mailto:josip at icase.edu
ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134
More information about the Beowulf
mailing list