IDE disk errors
J. G. LaBounty
jgl at unix.shell.com
Wed Jun 13 11:04:42 PDT 2001
Thanks for your input. We just this morning booted our 50 node
Supermicro cluster
with the noapic option. I will post to the group if it solves our problem.
> From: "Michael T. Prinkey" <mprinkey at aeolusresearch.com>
>
> Hi John,
>
> I have encountered similar problems. I solved them by building the
> kernel without APIC, or by running the kernel with the noapic option.
>
> Regards,
>
> Mike Prinkey
> Aeolus Research, Inc.
>
> "J. G. LaBounty" wrote:
> >
> >
> > We are being swamped with disk errors. Most of the errors are logged
> > as follows:
> >
> > Jun 12 01:44:40 scf402n kernel: hda: dma_intr: status=0x51 {
DriveReady SeekComplete Error }
> > Jun 12 01:44:40 scf402n kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=7975408, sector=2625696
> > Jun 12 01:44:40 scf402n kernel: end_request: I/O error, dev 03:08
(hda), sector 2625696
> >
> > Everything that I can find says this is a media problem. Our
typical recovery
> > procedure is to:
> >
> > 1. run e2fsck -c -v -y /dev/hdX
> > We will run this procedure following a disk error but eventually the
> > system will hang or we get so many errors, it will take too long to
> > complete (over 2 hours, with no errors it takes about 45 minutes).
> > 2. If #1 fails, we will run the IBM DFT utility to reformat the
drive. After
> > reformating we have run e2fsck -c and it finds no errors. If reformat
> > fails, we return the drive for replacement.
> >
> > Configuration:
> > Number Motherboard CPU DISK per node
AGE # Failures
> > 34 nodes on ASUS P2BD 2-600MHz cpus 2 Western Digital
26gb drives 18 months 6
> > 50 nodes on ASUS P2BD 2-800MHz cpus 2 IBM deskstar 30
gb drives 8 months 21
> > 150 nodes on Tyan 2500 2-800MHz cpus 2 IBM deskstar 45
gb drives 6 months 104
> > Disks are attached to a Promise 100 card
> > 50 nodes on Supermicro 370DLE 2-1GHz cpus 2 IBM deskstar 60
gb drives 2 months 28
> >
> > All nodes are running Redhat 6.2 with a 2.2.16 kernel. DMA is
turned on in the
> > kernel plus the Promise 100 patch is installed.
> >
> > For some reason most of our failures have been on the root disk. We have
> > tried running with root and swap on 1 disk and application scratch
space on the
> > second disk. While this seems to reduce the frequency of the
error, it does
> > not eliminate it.
> >
> > We are also dropping the transfer rate of the device back to a
slower speed. We
> > are using DMA mode. As a last resort, we may try PIO mode but really don't
> > want to take that performance hit.
> >
> > This may seem like a lot of work for drives under warranty but IBM
no longer makes
> > the 45 gb drive. Warranty returns are taking several weeks to get
the replacements.
> > We have found that the replacements are not any better than the
drives that
> > can be reformated.
> >
> > We have looked at moving to SCSI drives of similar size but don't
want to take the
> > price hit. Adding 2 - scsi drives and a controller would bump our
base price
> > 30 - 50%.
> >
> > Has anyone else experienced similar problems? Any suggestions as
what we could
> > try to alleviate the problem?
> >
> >
> > John
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
John
More information about the Beowulf
mailing list