Disk reliability (Was: Node cloning)

Donald Becker becker at scyld.com
Wed Apr 11 16:58:40 PDT 2001


On Wed, 11 Apr 2001, Robert G. Brown wrote:

> On Wed, 11 Apr 2001, Josip Loncaric wrote:
> 
> > "[...]  In most of these cases the
> > drive can heal itself of these errors.> 

> The only way I can imagine for this to actually work to heal the disk is
> if the drive's low-level formatting is somehow faulty.  There are two
> "generic" low-level causes of bad blocks.
>    One is simply imperfect plating or physical damage or...
>    The other kind of error is a dynamic mechanical or electrical error...
>
> I would assume the "erase" option is really a name for a new low level
> reformat that fixes the latter kind of error and MIGHT even help with

When they say "heal", they actually mean "remap to substitute disk
blocks reserved for this purpose".  They must have thought that the
concept of remapping disk blocks was too confusing.

The way most modern disks work is a three level error control scheme.  A
typical drive works as follows:

  A hardware-based convolutional decoder is applied to the signal
  coming off the read heads that picks the most likely value for
  marginal signals based on the surrounding bits.  The correction/error
  level info is usually discarded.

  A block check is applied to the resulting data block.  If an error is
  detected an error correcting step is taken by the drive firmware.  If
  few enough bits have been corrupted, the error is software corrected and
  perhaps written back to the same location.

  If too many bits have been corrupted to rely on the software error
  correcting code, the drive might return a soft error.  Either the
  driver or the OS re-tries the read several times.  If one of the 
  re-reads works, the corrected data is written back, perhaps to a newly
  remapped disk block.  If the re-read doesn't work, the drive returns a
  hard error and remaps the bad block to a reserved good block.

You can guess what is happening with a drive by using the SMART data.
 If you have plenty of stand-by blocks, you have good disk platters.
 If the number of stand-by blocks is decreasing, something is going
  wrong.  You should think about ordering a replacement drive before you
  see a hard error.
 If the number of stand-by blocks is approaching zero, buy a new drive
  Right Now.  You've been lucky to not have encountered a hard error.  Or
  maybe you just have been lucky not to notice it.


[[ I fondly remember my first real job, working for Bib Cain and George
Clark at Harris-ATD.  They wrote one of the important books on Error
Correcting Coding.  It was enlightening to hear people talk about
algorithms and circuits in terms of dB. ]]

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993





More information about the Beowulf mailing list