Donald Becker becker at scyld.com
Tue Jul 17 11:29:17 PDT 2001

On Tue, 17 Jul 2001, Steven Timm wrote:

> We are currently burning in a new cluster and seeing the following
> problem:
> We see a number of files, usually contiguous in the same directory,
> that ls will list as being there, but ls -l will show Input/output error.
> An fsck of the system gets rid of the I/O errors but also gets
> rid of the file.  There is no error message on the console, nor
> in /var/log/messages, to indicate any disk controller problems.

I'm guessing that you are running a 2.4 kernel.
There are a collection of related bugs in the 2.4 kernel IDE and VM
systems.  Note that the 'ac' series (ac==Alan Cox) VM subsystem is
substantially different than Linus' kernel in an attempt to track this

> The problem appears to get worse over time, over a period of a few
> days the majority of our 136 machines exhibit these errors.

One aspect of running clusters is that any kernel problem is
dramatically magnified.  We frequently get questions about switching to
a 2.4 kernel, but it's rarely from people with medium or large clusters.

> We have downgraded a few machines to the 2.2.16 kernel, and this
> appears to be OK, but it is a bit early to tell.

We are staying with the 2.2 kernel for now.
The 2.4.6 kernel looks pretty good, but it's still too early to tell.

