Unexplained I/O errors

Matthijs van Leeuwen m.vanleeuwen at compusys.co.uk
Tue Jul 17 08:41:31 PDT 2001


We had similar problems with UDMA and the 2.4.X kernels
on this particular type of MB. You could try setting
/sbin/hdparm -c1 -d1 -X32 /dev/hda

You can also try running a kernel without any UDMA
support compiled in.

A third option is to compile the kernel without initial
UDMA support.

A fourth possibility is that one of your hardware components is
really broken ;-)

Good luck,
Matthijs
____________________________
Dr ir Matthijs van Leeuwen
HPC Specialist
Compusys Plc, 58 Edison Road
Rabans Lane Industrial Estate
Aylesbury, Bucks HP19 8UT, UK
Tel: +44 (0)1296 505143
Fax: +44 (0)1296 424165
Email: m.vanleeuwen at compusys.co.uk
Web: http://www.compusys.co.uk


-----Original Message-----
From: Steven Timm [mailto:timm at fnal.gov]
Sent: Tuesday, July 17, 2001 4:19 PM
To: beowulf at beowulf.org
Subject: Unexplained I/O errors



Hi everyone,

We are currently burning in a new cluster and seeing the following
problem:

We see a number of files, usually contiguous in the same directory,
that ls will list as being there, but ls -l will show Input/output error.
An fsck of the system gets rid of the I/O errors but also gets
rid of the file.  There is no error message on the console, nor
in /var/log/messages, to indicate any disk controller problems.

The problem appears to get worse over time, over a period of a few
days the majority of our 136 machines exhibit these errors.

Our configuration:  Supermicro 370DLE motherboard, 2x1000MHz pentium III,
512 MB ram, Seagate system disk (30 GB)  and CDROM on IDE primary,
2x40GB IBM drives on IDE secondary.
hda: ST330620A, ATA DISK drive
hdb: CD-ROM 48X/AKH, ATAPI CDROM drive
hdc: IC35L040AVER07-0, ATA DISK drive
hdd: IC35L040AVER07-0, ATA DISK drive

I/O errors happen only on the system disk.

We swapped out a large number of IDE cables for the system disk,
replacing them with a better grade, with no luck.

We have downgraded a few machines to the 2.2.16 kernel, and this
appears to be OK, but it is a bit early to tell.

We have also pulled the CD roms off of a few machines and this
also appears to be stable but we need more data yet.

Any idea what could be causing all of this?

Steve



------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


**********************************************************************
Disclaimer
This email is confidential and intended solely for the use of the individual to whom it is addressed. Any views or opinions presented are solely those of the author and do not necessarily represent those of Compusys or any of it's affiliates. If you are not the intended recipient, be advised that you have received this email in error and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited.  If you have received this email in error please notify Compusys Customer Services by telephone on +44(0)1296 505140

This footnote also confirms that this email message has been swept by MIMEsweeper for the presence of computer viruses.
**********************************************************************




More information about the Beowulf mailing list