Network problems with VIA chipset and Athlon XP

Wed Oct 23 02:39:26 PDT 2002

Hi all

  We have a set of systems that become network unreachable, presenting the 
following messages in dmesg output, repeated at aprox. 5 sec. intervals:

NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc664010, CSR12 00000000, resetting...
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting...
eth0: Out-of-sync dirty pointer, 34067 vs. 34084.
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc664010, CSR12 00000000, resetting...
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting...
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting...
eth0: Out-of-sync dirty pointer, 34089 vs. 34106.

 The system setup is as follows: A beowulf cluster composed of 16 nodes and 
one master machine connected to a 3Com (3C17203) 24 Port 100Mbit ethernet 
switch. The nodes are all identical and use an Asus A7V266-EX motherboard 
(VIA KT266), Athlon XP 1800+ CPU, 1.5GB of PC2100 DDR RAM, a 40GB Seagate IDE 
disk, Accton EN-1216 10/100 NIC (Tulip) and ATI Rage XL AGP graphics card. 
Each machine runs Debian Linux Testing distribution with custom compiled 
vanilla 2.4.18 kernel with HighMem support and Athlon optimizations. Each 
node mounts its /home directory from the master machine via NFSv3.

 The problem happens when the nodes are executing a parallel computation job 
that involves high CPU usage and periodic but heavy TCP/IP network traffic 
between the nodes and/or the master machine. The test computation job is the 
XHPL benchmark available at http://www.netlib.org/benchmark/hpl/ but we've 
been able to reproduce the problem with other codes using the MPI libraries. 
Strangely, doing network intensive tasks like big file transfer does not 
trigger the errors. They only seem to show up with network and CPU intensive 
tasks.

 Each time, the errors happen in one of the 16 machines at random, and 
normally 5 to 15 minutes after the job was started. After that, the affected 
machine becomes totally unresponsive to network and starts printing the above 
errors to the console endlessly. Logging in as root on a VT is possible. 
However unconfiguring the interface and reconfiguring again does not help. 
Attempting to reboot results in a hang just after running the init.d shutdown 
scripts.

 Interestingly the ifconfig and mii-tool commands return nothing abnormal in 
their output after the errors occur:

root at bnode06# mii-tool -v eth0
eth0: negotiated 100baseTx-FD, link ok
  product info: vendor 00:08:95, model 1 rev 0
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD

--- 
 I have to stress that the 100MBit Full Duplex mode is not forced by any 
initialization script, but is the result of autonegotiation, as you can see 
from the kernel messages attached at the end of this e-mail.

 I googled for a while and found some mailing list messages of people 
reporting errors that resembled this one, and some replies by Donald Becker. 
But these aparently had tried to set the link modes manually, to a setting not 
supported by their switches/hubs, or something like that...

 We have tried to troubleshoot the situation with a number of actions, all of 
them unsuccessful:

 * Forcing the NICs to use 100baseTx-HD setting.

 * Switching all the tulip NICs by 3Com 3C905B cards. This gave a very similar 
set of errors (although the error codes were different) and only made the 
situation worse, as when the machine was rebooted the ext3 disk partitions 
became unmountable (kernel panic mounting root fs) and fsck would not repair 
them.

 * Switching to a 2.4.18 kernel compiled for i686 without HIGHMEM support.

 * Switching to 2.4.20pre8 kernel with HIGHMEM and Athlon support

 * Switching to a 2.2.22 kernel without HIGHMEM or Athlon support

 * Switching to 2.5.42 kernel with HIGHMEM and Athlon support.

 * Exchanging the 3Com switch for an older Cabletron fast ethernet switch

 So this does not appear to be a problem with a specific NIC or driver. It is 
also much more difficult (if even possible) to trigger the problem with few 
nodes. Normally it only shows up easily with about 6 or more nodes doing 
computation and communicating with each other.

 Since we have a smaller cluster with Intel 440BX chipsets and PII 300MHz, 
with the 2.4.18 kernel running exactly the same programs and tools working 
flawlessly no matter how hard we pound it, we begin to suspect some kind of 
bad interaction between linux and those VIA chipsets, or maybe even a chipset 
bug.

  I've also changed TULIP_DEBUG in linux/drivers/net/tulip/tulip.h to 6 and, 
with this value, interestingly we cannot trigger the problem, at least so 
easily. Maybe the extra printks help because they generate extra interrupts? 
Could this be related to interrupt service routine problems?

 So to resume it in a few words I'm open to any questions or suggestions about 
the problem. I'm ready to make any debugging sessions you might want me to do 
and to apply any patches you might want me to test, and to dig up in the 
kernel source, in order to try to solve this problem (I really have to put 
this cluster working ;-).

  I have posted this to Linux Kernel Mailing List some time ago but got only 
one reply from a person who had similar problems some time ago, but with AMD 
K6 hardware.

  I have put lspci and dmesg output and kernel .config file used (it seemed 
too big to get through the list) at the following URL:

          http://mega.ist.utl.pt/~ctpm/messages.txt

Thanks in advance for your help. Best regards

Claudio
LASEF - Lisbon