Disk reliability (Was: Node cloning)

Josip Loncaric josip at icase.edu
Mon Apr 9 07:06:49 PDT 2001


Mark Hahn wrote:
> 
> > kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> > kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> > kernel: hda: read_intr: status=0x5b { DriveReady SeekComplete
> > DataRequest Index Error }
> 
> this has NOTHING to do with bad blocks; it's purely a cabling
> (or possibly clocking) issue.  a transfer (just the transfer)
> has failed its checksum, and will been retried.

These were three completely different errors occuring at completely
different times.  One of them is a BadCRC error (which your comments
address) but the others appear to be due to problems in servo tracking. 
BTW, our IDE setup satisfies the IDE guidelines you mentioned (<18",
both ends plugged in, UDMA33, a single (master) drive).

> (I have one myself, still in use.  early SMART implementations like
> this one don't seem to provide any useful numbers.)

We also noticed that :-(.  In fact, no Linux S.M.A.R.T. utility (I tried
two different ones) produced reasonable output with our disks, including
recent IBM models.  Fortunately, IBM's Disk Fitness Test tool seems to
work correctly.  Basically, it confirms the 'badblocks' diagnosis and
recommends erasing the disk.  I found it easier to map out the bad
blocks using 'e2fsck -c ...'.

> have you run badblocks multiple times, and compared the output?

That would make sense, if CRC errors were dominant.  However, the CRC
errors are very rare, while SeekComplete errors are seen with regularity
in the same spots.  

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134




More information about the Beowulf mailing list