Network problems with VIA chipset and Athlon XP
Claudio Martins
ctpm at mega.ist.utl.pt
Wed Oct 23 02:39:26 PDT 2002
Hi all
We have a set of systems that become network unreachable, presenting the
following messages in dmesg output, repeated at aprox. 5 sec. intervals:
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc664010, CSR12 00000000, resetting...
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting...
eth0: Out-of-sync dirty pointer, 34067 vs. 34084.
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc664010, CSR12 00000000, resetting...
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting...
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting...
eth0: Out-of-sync dirty pointer, 34089 vs. 34106.
The system setup is as follows: A beowulf cluster composed of 16 nodes and
one master machine connected to a 3Com (3C17203) 24 Port 100Mbit ethernet
switch. The nodes are all identical and use an Asus A7V266-EX motherboard
(VIA KT266), Athlon XP 1800+ CPU, 1.5GB of PC2100 DDR RAM, a 40GB Seagate IDE
disk, Accton EN-1216 10/100 NIC (Tulip) and ATI Rage XL AGP graphics card.
Each machine runs Debian Linux Testing distribution with custom compiled
vanilla 2.4.18 kernel with HighMem support and Athlon optimizations. Each
node mounts its /home directory from the master machine via NFSv3.
The problem happens when the nodes are executing a parallel computation job
that involves high CPU usage and periodic but heavy TCP/IP network traffic
between the nodes and/or the master machine. The test computation job is the
XHPL benchmark available at http://www.netlib.org/benchmark/hpl/ but we've
been able to reproduce the problem with other codes using the MPI libraries.
Strangely, doing network intensive tasks like big file transfer does not
trigger the errors. They only seem to show up with network and CPU intensive
tasks.
Each time, the errors happen in one of the 16 machines at random, and
normally 5 to 15 minutes after the job was started. After that, the affected
machine becomes totally unresponsive to network and starts printing the above
errors to the console endlessly. Logging in as root on a VT is possible.
However unconfiguring the interface and reconfiguring again does not help.
Attempting to reboot results in a hang just after running the init.d shutdown
scripts.
Interestingly the ifconfig and mii-tool commands return nothing abnormal in
their output after the errors occur:
root at bnode06# mii-tool -v eth0
eth0: negotiated 100baseTx-FD, link ok
product info: vendor 00:08:95, model 1 rev 0
basic mode: autonegotiation enabled
basic status: autonegotiation complete, link ok
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
---
I have to stress that the 100MBit Full Duplex mode is not forced by any
initialization script, but is the result of autonegotiation, as you can see
from the kernel messages attached at the end of this e-mail.
I googled for a while and found some mailing list messages of people
reporting errors that resembled this one, and some replies by Donald Becker.
But these aparently had tried to set the link modes manually, to a setting not
supported by their switches/hubs, or something like that...
We have tried to troubleshoot the situation with a number of actions, all of
them unsuccessful:
* Forcing the NICs to use 100baseTx-HD setting.
* Switching all the tulip NICs by 3Com 3C905B cards. This gave a very similar
set of errors (although the error codes were different) and only made the
situation worse, as when the machine was rebooted the ext3 disk partitions
became unmountable (kernel panic mounting root fs) and fsck would not repair
them.
* Switching to a 2.4.18 kernel compiled for i686 without HIGHMEM support.
* Switching to 2.4.20pre8 kernel with HIGHMEM and Athlon support
* Switching to a 2.2.22 kernel without HIGHMEM or Athlon support
* Switching to 2.5.42 kernel with HIGHMEM and Athlon support.
* Exchanging the 3Com switch for an older Cabletron fast ethernet switch
So this does not appear to be a problem with a specific NIC or driver. It is
also much more difficult (if even possible) to trigger the problem with few
nodes. Normally it only shows up easily with about 6 or more nodes doing
computation and communicating with each other.
Since we have a smaller cluster with Intel 440BX chipsets and PII 300MHz,
with the 2.4.18 kernel running exactly the same programs and tools working
flawlessly no matter how hard we pound it, we begin to suspect some kind of
bad interaction between linux and those VIA chipsets, or maybe even a chipset
bug.
I've also changed TULIP_DEBUG in linux/drivers/net/tulip/tulip.h to 6 and,
with this value, interestingly we cannot trigger the problem, at least so
easily. Maybe the extra printks help because they generate extra interrupts?
Could this be related to interrupt service routine problems?
So to resume it in a few words I'm open to any questions or suggestions about
the problem. I'm ready to make any debugging sessions you might want me to do
and to apply any patches you might want me to test, and to dig up in the
kernel source, in order to try to solve this problem (I really have to put
this cluster working ;-).
I have posted this to Linux Kernel Mailing List some time ago but got only
one reply from a person who had similar problems some time ago, but with AMD
K6 hardware.
I have put lspci and dmesg output and kernel .config file used (it seemed
too big to get through the list) at the following URL:
http://mega.ist.utl.pt/~ctpm/messages.txt
Thanks in advance for your help. Best regards
Claudio
LASEF - Lisbon
More information about the Beowulf
mailing list