[tulip-bug] driver failure under high NFS load

Donald Becker becker@scyld.com
Sun May 12 12:25:01 2002

On Sat, 11 May 2002, Greg Wooledge wrote:

> I'm running Linux 2.2.20 on a K6-2 333 MHz (320 MB RAM) with tulip.c:v0.93
> (as a module).  My NIC is reported by lspci -v as:
> 00:08.0 Ethernet controller: Linksys Network Everywhere Fast Ethernet 10/100 model NC100 (rev 11)
>         Subsystem: Linksys: Unknown device 0574
>         Flags: bus master, medium devsel, latency 64, IRQ 9
> I'm loading the module with parameters "debug=1 options=13".

Why are you forcing the speed/duplex?  That's normally not needed,
especially forcing half duplex.

> This machine is both an NFS server (kernel NFS) and NFS client, but it
> does a lot more client operations than server.  Sometimes, when I'm
> doing a lot of NFS reads and writes (e.g., ripping CDs and encoding
> the resulting files to Vorbis on an NFS moutned file system), the NIC
> will stop working altogether.

What does 'tulip-diag' report when the interface is in this state?

  I can work around this by bringing the
> interface down, removing the module, re-modprobe'ing, and then bringing
> the interface up -- *EXCEPT* that the NFS file system which triggered
> the problem (/music) is now completely inaccessible.

This is a NFS bug.

> May 10 21:27:10 jekyll kernel: eth0: Too much work during an interrupt, csr5=0xfc69c0d0.
> May 10 21:27:10 jekyll kernel: eth0: Restarted Rx at 705859 / 705859.

This is "normal", but the driver detected a work overload an shut down
briefly to reduce the system workloads.  This is usually caused some
other device driver consuming too much CPU in an interrupt handler or with
interrupts blocked.  It can also be caused by the kernel running out of
usable memory and spending a bunch of time trying to locate free pages.

In the latter case you can tune the kernel's free memory reserve to
avoid the problem.

Presumably everything continued normally at this point.

> May 11 16:47:24 jekyll kernel: tulip.c:v0.93 11/7/2001  Written by Donald Becker <becker@scyld.com>
> May 11 16:47:24 jekyll kernel: http://www.scyld.com/network/tulip.html
> May 11 16:47:24 jekyll kernel: eth0: ADMtek Centaur-P rev 17 at 0xd48c9000, 00:20:78:1E:E9:BF, IRQ 9.
> May 11 16:47:24 jekyll kernel: eth0: Transceiver selection forced to MII 100baseTx.

I'm guessing that this is what you intended...

> May 11 16:47:27 jekyll kernel: nfs: server dwarf OK
> May 11 16:47:28 jekyll kernel: nfs: task 1367720 can't get a request slot
> May 11 16:47:28 jekyll kernel: nfs: task 1367721 can't get a request slot
> May 11 16:47:28 jekyll kernel: nfs: task 1367722 can't get a request slot

This is a NFS bug, likely triggered by a shortage of available free
memory.  (Remember that having a bunch of RAM doesn't mean that the
kernel has free pages ready for use.

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993