Disk reliability (Was: Node cloning)

Josip Loncaric josip at icase.edu
Wed Apr 11 14:38:15 PDT 2001

Greg Lindahl wrote:
> > For IBM drives (IDE or SCSI), one can download and use the Drive Fitness
> > Test utility (see
> > http://www.storage.ibm.com/techsup/hddtech/welcome.htm).  This program
> > can diagnose typical problems with hard drives.  In many cases, bad
> > blocks can be 'healed' by erasing the drive using this utility (back up
> > your data first, and be prepared for the 'Erase Disk' to take an hour or
> > more).  If that fails and your drive is under warranty, the drive ought
> > to be replaced.
> If a sector returns the wrong result 0.01% of the time, it is bad, but
> testing is unlikely to be intensive enough to detect it (10,000
> reads...) If you "heal" it, it will appear to work at first, but it
> will eventually turn up bad again. So all you're doing is papering
> over the problem. You ought to just replace the disk.

Granted, testing once or twice is not perfect, but if that fails, you
can always replace the disk later, which is what IBM suggests.  The
suggestion to 'heal' the disk using IBM's Drive Fitness Test comes from
its manual:


which says (on pg.27)

"[...] For example if during testing of your hard drive DFT reports a
error code of 0x70 as shown on page 14, this indicates that your hard
disk drive has one or more bad sectors.  In most of these cases the
drive can heal itself of these errors.  To do this first back-up all
your data from the problem drive (if possible) then run DFT again and
select the Erase Disk option which is under the Utilities heading.
[...]  Once erase disk has completed you can then run one of the test
options Quick or Advance to confirm htat the drive has been healed.  The
result code, which should be displayed, is 0x00 if the test returns
another code then you should check with your drive/system vendor if the
drive can be return for warranty replacement." (sic!)


P.S.  I'm guessing that the manufacturer's list of bad blocks (written
at the disk drive factory) is the result of very limited testing (a few
times at most).  The drive you return for replacement will be subjected
to more testing (only a few times), and if it passes (with an updated
list of bad blocks) it will probably be used as a refurbished drive,
replacing the ones people sent in for replacement...  No manufacturer
can afford to perform 10,000 reads of an entire 30GB drive since that
would take at least 6 months per drive...

Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

More information about the Beowulf mailing list