Disk reliability (Was: Node cloning)

Robert G. Brown rgb at phy.duke.edu
Wed Apr 11 15:23:25 PDT 2001

On Wed, 11 Apr 2001, Josip Loncaric wrote:

> "[...] For example if during testing of your hard drive DFT reports a
> error code of 0x70 as shown on page 14, this indicates that your hard
> disk drive has one or more bad sectors.  In most of these cases the
> drive can heal itself of these errors.  To do this first back-up all
> your data from the problem drive (if possible) then run DFT again and
> select the Erase Disk option which is under the Utilities heading.
> [...]  Once erase disk has completed you can then run one of the test
> options Quick or Advance to confirm htat the drive has been healed.  The
> result code, which should be displayed, is 0x00 if the test returns
> another code then you should check with your drive/system vendor if the
> drive can be return for warranty replacement." (sic!)

The only way I can imagine for this to actually work to heal the disk is
if the drive's low-level formatting is somehow faulty.  There are two
"generic" low-level causes of bad blocks.  One is simply imperfect
plating or physical damage or anything else that results in an area of
the disk that won't hold its ferromagnetic magnetization.  This is the
kind of error that Greg talks about -- erasing or reformatting or
whatever won't fix this -- the only thing that will "fix" it is marking
it out as bad.

The other kind of error is a dynamic mechanical or electrical error -- a
write head starts to write a tiny bit early during a move and overwrites
a track boundary or other "soft" format data that defines and stabilizes
the disk geometry.  In the old days this was pretty common, disks were
awesomely expensive, and most disks came with a "low level format"
utility that would redraw all the tracks and mark out all the bad blocks
(with an optional feature that would look for bad blocks in the event
that you accidentally trashed the bad block list on the disk itself).  I
spent many a happy hour waiting for these utilities to finish, and
sometimes they would even work.

I would assume the "erase" option is really a name for a new low level
reformat that fixes the latter kind of error and MIGHT even help with
the former, if the bad blocks are "bad enough".  However, ferromagnetism
is nastily nonlinear and a bad block can very gradually lose its
information -- be "almost" stable.

Another bad thing is that a disk that generates dynamical errors that
screw up low level formatting and hence "blocks" -- not quite perfectly
synchronizing on its read/write activity on certain patterns of use, or
(as was the documented case for certain disks some years ago) writing
before it fully spins up to speed -- can ALSO "work" after being "fixed"
with a low level format or badblocks run, but the problem is generally
fundamental and will simply come back again later.  On some disks that
did this the disks gradually deteriorated until not even badblocks could
repair them.  There are definitely disks that are just plain "lemons",
although IBM disks are admittedly pretty good.

Disk errors make me nervous enough that I'm mostly with Greg on this one
-- if one can get them replaced for free, do it, and if you value your
own time consider spending money (preferrably other people's money, of
course:-) to replace them if necessary.  There is always a subset of
possible errors that will pass a CRC test or miss error detection
routines and a bit error in a binary or data file is as undesirable as a
bit error in memory.  In a way you're lucky if it just causes immediate
system failure.  Back when low level recovery tools were ubiquitous (and
disks cost thousands of dollar, which is WHY they were ubiquitous:-) I
certainly used them a lot, but my "three year survival rate" success
rate with them has overall been very low.

In some cases semi-recoverable failure has occurred near the warranty
boundaries and delay has cost me the opportunity to replace under
warranty.  Of course Moore's law for disk has been if anything more
aggressive than ML for other system components (shorter constant cost
doubling time) so perhaps this isn't a big deal, but nowadays if a disk
fritzes under warranty I just take it back and get a new one right away.
Often a bigger one, since small disks are discontinued so aggressively.


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list