IDE Seek Errors after kernel upgrade
Josip Loncaric
josip at icase.edu
Mon Jun 18 09:42:04 PDT 2001
Mark Hahn wrote:
>
> > It is possible that SeekComplete errors are due to some difficulty that
> > the drive has in tracking the servo signal in a few spots. Not
> > accessing those spots gets around the problem.
>
> no. badcrc's have nothing to do with any disk operation -
> they're strictly a cable/mode/noise problem.
The original post was mine. BadCRC errors and SeekComplete errors are
NOT related. They happened at different times (and also on different
nodes); I only listed them together to save space. Knowing that
cable/mode/noise problems cause BadCRC errors does not say anything
about SeekComplete errors, which are probably due to servo tracking
problems.
Today's drives have extremely high track density, so servo tracking
requires very high precision. The largest source of tracking errors is
runout (deviation from the ideal track shape). The servo control
algorithm estimates the repeatable runout and compensates using a
feedforward signal. Some less-than-ideal designs estimate the
compensation parameters only at power-up, so if such a drive is on for
months at a time, its mechanical parameters could drift away from the
estimates. Unlike SCSI drives, most IDE drives are designed for light
duty (e.g. being on only 11hrs/day). Using them 24hr/day, 365days/year
can create mechanical problems faster than the manufacturer expected.
As the drive's ball bearings wear, non-repeatable runout (NRRO) can
become an insurmountable problem for the servo tracking algorithm. For
this reason (and to reduce noise and cost) some recent IDE drives use
fluid dynamic bearings, which are expected to reduce NRRO by an order of
magnitude.
A few comments regarding hard disk reliability, the way I understand it:
(1) Embedded servo signals are written at the factory using high
precision machines. This process cannot be duplicated by the drive
itself.
(2) Some checking is done and a factory list of bad blocks is
generated. If the drive is within tolerance, it is shipped.
(3) Today's IDE drives can map out a small number of bad blocks
automatically. If the drive exceeds this number, the OS will start to
see them.
(4) When bad blocks (or SeekComplete errors) are found, you have three
choices:
(i) map them out using Linux 'e2fsck -c ...' or 'mkswap -c ...'
(ii) if you have IBM drives, use IBM's Disk Fitness Test to check
the drive, map out bad blocks and zero the disk. Afterwards,
the drive can continue to map out bad blocks as they develop,
hiding them from the OS for a while.
(iii) if neither (i) nor (ii) provide a long term fix, replace the
drive
(5) When you return a drive under warranty, you'll get a remanufactured
replacement drive. "Remanufactured" probably means that it was
subjected to some testing at the factory, had its factory list of bad
blocks updated, and if it tested within tolerance, was shipped. This
process is similar to what IBM's Disk Fitness Test does; so the
replacement drives have a similar chance of being bad. A bad drive may
need to be replaced several times before a good drive is found.
(6) Finally, the drive(s) might be OK and the problem may lie
elsewhere. If a kernel upgrade degraded drive reliability, most likely
the problem is in software, not hardware.
Sincerely,
Josip
P.S. http://www.storage.ibm.com/hardsoft/diskdrdl/library/technolo.htm
--
Dr. Josip Loncaric, Research Fellow mailto:josip at icase.edu
ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134
More information about the Beowulf
mailing list