Network problems with VIA chipset and Athlon XP
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Claudio Martins ctpm at mega.ist.utl.ptWed Oct 23 02:39:26 PDT 2002
- Previous message: thermal kill switch
- Next message: Network problems with VIA chipset and Athlon XP
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all We have a set of systems that become network unreachable, presenting the following messages in dmesg output, repeated at aprox. 5 sec. intervals: NETDEV WATCHDOG: eth0: transmit timed out eth0: Transmit timed out, status fc664010, CSR12 00000000, resetting... NETDEV WATCHDOG: eth0: transmit timed out eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting... eth0: Out-of-sync dirty pointer, 34067 vs. 34084. NETDEV WATCHDOG: eth0: transmit timed out eth0: Transmit timed out, status fc664010, CSR12 00000000, resetting... NETDEV WATCHDOG: eth0: transmit timed out eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting... NETDEV WATCHDOG: eth0: transmit timed out eth0: Transmit timed out, status fc684010, CSR12 00000000, resetting... eth0: Out-of-sync dirty pointer, 34089 vs. 34106. The system setup is as follows: A beowulf cluster composed of 16 nodes and one master machine connected to a 3Com (3C17203) 24 Port 100Mbit ethernet switch. The nodes are all identical and use an Asus A7V266-EX motherboard (VIA KT266), Athlon XP 1800+ CPU, 1.5GB of PC2100 DDR RAM, a 40GB Seagate IDE disk, Accton EN-1216 10/100 NIC (Tulip) and ATI Rage XL AGP graphics card. Each machine runs Debian Linux Testing distribution with custom compiled vanilla 2.4.18 kernel with HighMem support and Athlon optimizations. Each node mounts its /home directory from the master machine via NFSv3. The problem happens when the nodes are executing a parallel computation job that involves high CPU usage and periodic but heavy TCP/IP network traffic between the nodes and/or the master machine. The test computation job is the XHPL benchmark available at http://www.netlib.org/benchmark/hpl/ but we've been able to reproduce the problem with other codes using the MPI libraries. Strangely, doing network intensive tasks like big file transfer does not trigger the errors. They only seem to show up with network and CPU intensive tasks. Each time, the errors happen in one of the 16 machines at random, and normally 5 to 15 minutes after the job was started. After that, the affected machine becomes totally unresponsive to network and starts printing the above errors to the console endlessly. Logging in as root on a VT is possible. However unconfiguring the interface and reconfiguring again does not help. Attempting to reboot results in a hang just after running the init.d shutdown scripts. Interestingly the ifconfig and mii-tool commands return nothing abnormal in their output after the errors occur: root at bnode06# mii-tool -v eth0 eth0: negotiated 100baseTx-FD, link ok product info: vendor 00:08:95, model 1 rev 0 basic mode: autonegotiation enabled basic status: autonegotiation complete, link ok capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD --- I have to stress that the 100MBit Full Duplex mode is not forced by any initialization script, but is the result of autonegotiation, as you can see from the kernel messages attached at the end of this e-mail. I googled for a while and found some mailing list messages of people reporting errors that resembled this one, and some replies by Donald Becker. But these aparently had tried to set the link modes manually, to a setting not supported by their switches/hubs, or something like that... We have tried to troubleshoot the situation with a number of actions, all of them unsuccessful: * Forcing the NICs to use 100baseTx-HD setting. * Switching all the tulip NICs by 3Com 3C905B cards. This gave a very similar set of errors (although the error codes were different) and only made the situation worse, as when the machine was rebooted the ext3 disk partitions became unmountable (kernel panic mounting root fs) and fsck would not repair them. * Switching to a 2.4.18 kernel compiled for i686 without HIGHMEM support. * Switching to 2.4.20pre8 kernel with HIGHMEM and Athlon support * Switching to a 2.2.22 kernel without HIGHMEM or Athlon support * Switching to 2.5.42 kernel with HIGHMEM and Athlon support. * Exchanging the 3Com switch for an older Cabletron fast ethernet switch So this does not appear to be a problem with a specific NIC or driver. It is also much more difficult (if even possible) to trigger the problem with few nodes. Normally it only shows up easily with about 6 or more nodes doing computation and communicating with each other. Since we have a smaller cluster with Intel 440BX chipsets and PII 300MHz, with the 2.4.18 kernel running exactly the same programs and tools working flawlessly no matter how hard we pound it, we begin to suspect some kind of bad interaction between linux and those VIA chipsets, or maybe even a chipset bug. I've also changed TULIP_DEBUG in linux/drivers/net/tulip/tulip.h to 6 and, with this value, interestingly we cannot trigger the problem, at least so easily. Maybe the extra printks help because they generate extra interrupts? Could this be related to interrupt service routine problems? So to resume it in a few words I'm open to any questions or suggestions about the problem. I'm ready to make any debugging sessions you might want me to do and to apply any patches you might want me to test, and to dig up in the kernel source, in order to try to solve this problem (I really have to put this cluster working ;-). I have posted this to Linux Kernel Mailing List some time ago but got only one reply from a person who had similar problems some time ago, but with AMD K6 hardware. I have put lspci and dmesg output and kernel .config file used (it seemed too big to get through the list) at the following URL: http://mega.ist.utl.pt/~ctpm/messages.txt Thanks in advance for your help. Best regards Claudio LASEF - Lisbon
- Previous message: thermal kill switch
- Next message: Network problems with VIA chipset and Athlon XP
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
