IDE disk errors
J. G. LaBounty
jgl at unix.shell.com
Wed Jun 13 08:03:56 PDT 2001
We are being swamped with disk errors. Most of the errors are logged
as follows:
Jun 12 01:44:40 scf402n kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jun 12 01:44:40 scf402n kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=7975408, sector=2625696
Jun 12 01:44:40 scf402n kernel: end_request: I/O error, dev 03:08 (hda), sector 2625696
Everything that I can find says this is a media problem. Our typical recovery
procedure is to:
1. run e2fsck -c -v -y /dev/hdX
We will run this procedure following a disk error but eventually the
system will hang or we get so many errors, it will take too long to
complete (over 2 hours, with no errors it takes about 45 minutes).
2. If #1 fails, we will run the IBM DFT utility to reformat the drive. After
reformating we have run e2fsck -c and it finds no errors. If reformat
fails, we return the drive for replacement.
Configuration:
Number Motherboard CPU DISK per node AGE # Failures
34 nodes on ASUS P2BD 2-600MHz cpus 2 Western Digital 26gb drives 18 months 6
50 nodes on ASUS P2BD 2-800MHz cpus 2 IBM deskstar 30 gb drives 8 months 21
150 nodes on Tyan 2500 2-800MHz cpus 2 IBM deskstar 45 gb drives 6 months 104
Disks are attached to a Promise 100 card
50 nodes on Supermicro 370DLE 2-1GHz cpus 2 IBM deskstar 60 gb drives 2 months 28
All nodes are running Redhat 6.2 with a 2.2.16 kernel. DMA is turned on in the
kernel plus the Promise 100 patch is installed.
For some reason most of our failures have been on the root disk. We have
tried running with root and swap on 1 disk and application scratch space on the
second disk. While this seems to reduce the frequency of the error, it does
not eliminate it.
We are also dropping the transfer rate of the device back to a slower speed. We
are using DMA mode. As a last resort, we may try PIO mode but really don't
want to take that performance hit.
This may seem like a lot of work for drives under warranty but IBM no longer makes
the 45 gb drive. Warranty returns are taking several weeks to get the replacements.
We have found that the replacements are not any better than the drives that
can be reformated.
We have looked at moving to SCSI drives of similar size but don't want to take the
price hit. Adding 2 - scsi drives and a controller would bump our base price
30 - 50%.
Has anyone else experienced similar problems? Any suggestions as what we could
try to alleviate the problem?
John
More information about the Beowulf
mailing list