Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

IDE disk errors

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

J. G. LaBounty jgl at unix.shell.com
Wed Jun 13 11:04:42 PDT 2001


Thanks for your input. We just this morning booted our 50 node
Supermicro cluster
with the noapic option. I will post to the group if it solves our problem.


> From: "Michael T. Prinkey" <mprinkey at aeolusresearch.com>

> 
> Hi John,
> 
> I have encountered similar problems.  I solved them by building the
> kernel without APIC, or by running the kernel with the noapic option.
> 
> Regards,
> 
> Mike Prinkey
> Aeolus Research, Inc.
> 
> "J. G. LaBounty" wrote:
> > 
> > 
> >  We are being swamped with disk errors. Most of the errors are logged
> >  as follows:
> > 
> >  Jun 12 01:44:40 scf402n kernel: hda: dma_intr: status=0x51 {
DriveReady SeekComplete Error }
> >  Jun 12 01:44:40 scf402n kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=7975408, sector=2625696
> >  Jun 12 01:44:40 scf402n kernel: end_request: I/O error, dev 03:08
(hda), sector 2625696
> > 
> >  Everything that I can find says this is a media problem. Our
typical recovery
> >  procedure is to:
> > 
> >  1. run e2fsck -c -v -y /dev/hdX
> >     We will run this procedure following a disk error but eventually the
> >     system will hang or we get so many errors, it will take too long to
> >     complete (over 2 hours, with no errors it takes about 45 minutes).
> >  2. If #1 fails, we will run the IBM DFT utility to reformat the
drive. After
> >     reformating we have run e2fsck -c and it finds no errors. If reformat
> >     fails, we return the drive for replacement.
> > 
> >  Configuration:
> >  Number          Motherboard     CPU             DISK per node     
                   AGE          # Failures
> >  34 nodes on ASUS P2BD           2-600MHz cpus 2 Western Digital
26gb drives  18 months  6
> >  50 nodes on ASUS P2BD           2-800MHz cpus 2 IBM deskstar    30
gb drives  8 months  21
> >  150 nodes on Tyan 2500          2-800MHz cpus 2 IBM deskstar    45
gb drives  6 months  104
> >        Disks are attached to a Promise 100 card
> >  50 nodes on Supermicro 370DLE 2-1GHz cpus   2 IBM deskstar    60
gb drives  2 months  28
> > 
> >  All nodes are running Redhat 6.2 with a 2.2.16 kernel. DMA is
turned on in the
> >  kernel plus the Promise 100 patch is installed.
> > 
> >  For some reason most of our failures have been on the root disk. We have
> >  tried running with root and swap on 1 disk and application scratch
space on the
> >  second disk.  While this seems to reduce the frequency of the
error, it does
> >  not eliminate it.
> > 
> >  We are also dropping the transfer rate of the device back to a
slower speed. We
> >  are using DMA mode. As a last resort, we may try PIO mode but really don't
> >  want to take that performance hit.
> > 
> >  This may seem like a lot of work for drives under warranty but IBM
no longer makes
> >  the 45 gb drive. Warranty returns are taking several weeks to get
the replacements.
> >  We have found that the replacements are not any better than the
drives that
> >  can be reformated.
> > 
> >  We have looked at moving to SCSI drives of similar size but don't
want to take the
> >  price hit. Adding 2 - scsi drives and a controller would bump our
base price
> >  30 - 50%.
> > 
> >  Has anyone else experienced similar problems? Any suggestions as
what we could
> >  try to alleviate the problem?
> > 
> > 
> >  John
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


John






More information about the Beowulf mailing list