Unexplained I/O errors
Steven Timm
timm at fnal.gov
Tue Jul 17 08:19:01 PDT 2001
Hi everyone,
We are currently burning in a new cluster and seeing the following
problem:
We see a number of files, usually contiguous in the same directory,
that ls will list as being there, but ls -l will show Input/output error.
An fsck of the system gets rid of the I/O errors but also gets
rid of the file. There is no error message on the console, nor
in /var/log/messages, to indicate any disk controller problems.
The problem appears to get worse over time, over a period of a few
days the majority of our 136 machines exhibit these errors.
Our configuration: Supermicro 370DLE motherboard, 2x1000MHz pentium III,
512 MB ram, Seagate system disk (30 GB) and CDROM on IDE primary,
2x40GB IBM drives on IDE secondary.
hda: ST330620A, ATA DISK drive
hdb: CD-ROM 48X/AKH, ATAPI CDROM drive
hdc: IC35L040AVER07-0, ATA DISK drive
hdd: IC35L040AVER07-0, ATA DISK drive
I/O errors happen only on the system disk.
We swapped out a large number of IDE cables for the system disk,
replacing them with a better grade, with no luck.
We have downgraded a few machines to the 2.2.16 kernel, and this
appears to be OK, but it is a bit early to tell.
We have also pulled the CD roms off of a few machines and this
also appears to be stable but we need more data yet.
Any idea what could be causing all of this?
Steve
------------------------------------------------------------------
Steven C. Timm (630) 840-8525 timm at fnal.gov http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations
More information about the Beowulf
mailing list