IDE disk errors

J. G. LaBounty jgl at unix.shell.com
Wed Jun 13 11:04:42 PDT 2001


Thanks for your input. We just this morning booted our 50 node
Supermicro cluster
with the noapic option. I will post to the group if it solves our problem.


> From: "Michael T. Prinkey" <mprinkey at aeolusresearch.com>

> 
> Hi John,
> 
> I have encountered similar problems.  I solved them by building the
> kernel without APIC, or by running the kernel with the noapic option.
> 
> Regards,
> 
> Mike Prinkey
> Aeolus Research, Inc.
> 
> "J. G. LaBounty" wrote:
> > 
> > 
> >  We are being swamped with disk errors. Most of the errors are logged
> >  as follows:
> > 
> >  Jun 12 01:44:40 scf402n kernel: hda: dma_intr: status=0x51 {
DriveReady SeekComplete Error }
> >  Jun 12 01:44:40 scf402n kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=7975408, sector=2625696
> >  Jun 12 01:44:40 scf402n kernel: end_request: I/O error, dev 03:08
(hda), sector 2625696
> > 
> >  Everything that I can find says this is a media problem. Our
typical recovery
> >  procedure is to:
> > 
> >  1. run e2fsck -c -v -y /dev/hdX
> >     We will run this procedure following a disk error but eventually the
> >     system will hang or we get so many errors, it will take too long to
> >     complete (over 2 hours, with no errors it takes about 45 minutes).
> >  2. If #1 fails, we will run the IBM DFT utility to reformat the
drive. After
> >     reformating we have run e2fsck -c and it finds no errors. If reformat
> >     fails, we return the drive for replacement.
> > 
> >  Configuration:
> >  Number          Motherboard     CPU             DISK per node     
                   AGE          # Failures
> >  34 nodes on ASUS P2BD           2-600MHz cpus 2 Western Digital
26gb drives  18 months  6
> >  50 nodes on ASUS P2BD           2-800MHz cpus 2 IBM deskstar    30
gb drives  8 months  21
> >  150 nodes on Tyan 2500          2-800MHz cpus 2 IBM deskstar    45
gb drives  6 months  104
> >        Disks are attached to a Promise 100 card
> >  50 nodes on Supermicro 370DLE 2-1GHz cpus   2 IBM deskstar    60
gb drives  2 months  28
> > 
> >  All nodes are running Redhat 6.2 with a 2.2.16 kernel. DMA is
turned on in the
> >  kernel plus the Promise 100 patch is installed.
> > 
> >  For some reason most of our failures have been on the root disk. We have
> >  tried running with root and swap on 1 disk and application scratch
space on the
> >  second disk.  While this seems to reduce the frequency of the
error, it does
> >  not eliminate it.
> > 
> >  We are also dropping the transfer rate of the device back to a
slower speed. We
> >  are using DMA mode. As a last resort, we may try PIO mode but really don't
> >  want to take that performance hit.
> > 
> >  This may seem like a lot of work for drives under warranty but IBM
no longer makes
> >  the 45 gb drive. Warranty returns are taking several weeks to get
the replacements.
> >  We have found that the replacements are not any better than the
drives that
> >  can be reformated.
> > 
> >  We have looked at moving to SCSI drives of similar size but don't
want to take the
> >  price hit. Adding 2 - scsi drives and a controller would bump our
base price
> >  30 - 50%.
> > 
> >  Has anyone else experienced similar problems? Any suggestions as
what we could
> >  try to alleviate the problem?
> > 
> > 
> >  John
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


John






More information about the Beowulf mailing list