Steven Timm timm at fnal.gov
Tue Jul 17 08:19:01 PDT 2001

Hi everyone,

We are currently burning in a new cluster and seeing the following

We see a number of files, usually contiguous in the same directory,
that ls will list as being there, but ls -l will show Input/output error.
An fsck of the system gets rid of the I/O errors but also gets
rid of the file.  There is no error message on the console, nor
in /var/log/messages, to indicate any disk controller problems.

The problem appears to get worse over time, over a period of a few
days the majority of our 136 machines exhibit these errors.

Our configuration:  Supermicro 370DLE motherboard, 2x1000MHz pentium III,
512 MB ram, Seagate system disk (30 GB)  and CDROM on IDE primary,
2x40GB IBM drives on IDE secondary.
hda: ST330620A, ATA DISK drive
hdb: CD-ROM 48X/AKH, ATAPI CDROM drive
hdc: IC35L040AVER07-0, ATA DISK drive
hdd: IC35L040AVER07-0, ATA DISK drive

I/O errors happen only on the system disk.

We swapped out a large number of IDE cables for the system disk,
replacing them with a better grade, with no luck.

We have downgraded a few machines to the 2.2.16 kernel, and this
appears to be OK, but it is a bit early to tell.

We have also pulled the CD roms off of a few machines and this
also appears to be stable but we need more data yet.

Any idea what could be causing all of this?


