Disk reliability (Was: Node cloning)
Josip Loncaric
josip at icase.edu
Mon Apr 9 07:06:49 PDT 2001
Mark Hahn wrote:
>
> > kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> > kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> > kernel: hda: read_intr: status=0x5b { DriveReady SeekComplete
> > DataRequest Index Error }
>
> this has NOTHING to do with bad blocks; it's purely a cabling
> (or possibly clocking) issue. a transfer (just the transfer)
> has failed its checksum, and will been retried.
These were three completely different errors occuring at completely
different times. One of them is a BadCRC error (which your comments
address) but the others appear to be due to problems in servo tracking.
BTW, our IDE setup satisfies the IDE guidelines you mentioned (<18",
both ends plugged in, UDMA33, a single (master) drive).
> (I have one myself, still in use. early SMART implementations like
> this one don't seem to provide any useful numbers.)
We also noticed that :-(. In fact, no Linux S.M.A.R.T. utility (I tried
two different ones) produced reasonable output with our disks, including
recent IBM models. Fortunately, IBM's Disk Fitness Test tool seems to
work correctly. Basically, it confirms the 'badblocks' diagnosis and
recommends erasing the disk. I found it easier to map out the bad
blocks using 'e2fsck -c ...'.
> have you run badblocks multiple times, and compared the output?
That would make sense, if CRC errors were dominant. However, the CRC
errors are very rare, while SeekComplete errors are seen with regularity
in the same spots.
Sincerely,
Josip
--
Dr. Josip Loncaric, Research Fellow mailto:josip at icase.edu
ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134
More information about the Beowulf
mailing list