Unexplained I/O errors

Donald Becker becker at scyld.com
Tue Jul 17 11:29:17 PDT 2001


On Tue, 17 Jul 2001, Steven Timm wrote:

> We are currently burning in a new cluster and seeing the following
> problem:
> 
> We see a number of files, usually contiguous in the same directory,
> that ls will list as being there, but ls -l will show Input/output error.
> An fsck of the system gets rid of the I/O errors but also gets
> rid of the file.  There is no error message on the console, nor
> in /var/log/messages, to indicate any disk controller problems.

I'm guessing that you are running a 2.4 kernel.
There are a collection of related bugs in the 2.4 kernel IDE and VM
systems.  Note that the 'ac' series (ac==Alan Cox) VM subsystem is
substantially different than Linus' kernel in an attempt to track this
down.

> The problem appears to get worse over time, over a period of a few
> days the majority of our 136 machines exhibit these errors.

One aspect of running clusters is that any kernel problem is
dramatically magnified.  We frequently get questions about switching to
a 2.4 kernel, but it's rarely from people with medium or large clusters.

> We have downgraded a few machines to the 2.2.16 kernel, and this
> appears to be OK, but it is a bit early to tell.

We are staying with the 2.2 kernel for now.
The 2.4.6 kernel looks pretty good, but it's still too early to tell.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993





More information about the Beowulf mailing list