[eepro100] wait_for_cmd_done timeout

Wilson, John John.Wilson@savvis.net
Tue Mar 5 14:11:01 2002


eepro100 group:
I am seeing a problem with wait_for_cmd_done that is very similar to timeout
issue that I found on GeoCrawler.

I hope my findings will help resolve this problem.  Basically what I have
found is that I run into a simular time out problem when running Samba.  If
Samba is not enabled then I don't see this error and the ATM and eepro100
are fine.  Below is some fairly detailed output... thought this might help
track down the problem. 

In summary: It appears to me that the network is being flooded with ICMP
traffic (and possibly other traffic) and that the eepro100 may not be
handling the errors/traffic. (I'm new to Linux device drivers, so please
bear with me here). The period of the ICMP error messages may, in itself,
not be much traffic... so I assume there may be more to this problem, for
example the two device drivers may be sharing the same tx buffer and/or
memory.  Regardless if Samba is running, (which is probably raising the
amount of traffic causing the eepro100 problem to surface), I would like to
fix the eepro100 driver if there is a patch available for it.

I'm running:
	RH 7.2
	Kernel 2.4.9-13 modified to support the ATM device drivers (eni and
FORE (Marconi))
	ATM on Linux support software: linux-atm-2.4.0
	Samba

The system runs for some period, usually less than 24 hours, then eventually
the interfaces die with this error (from /var/log/messages)

Mar  5 08:36:46 sla2 kernel: 10.12.136.1 sent an invalid ICMP error to a
broadcast.
Mar  5 08:39:51 sla2 kernel: 10.12.136.1 sent an invalid ICMP error to a
broadcast.
Mar  5 08:41:46 sla2 kernel: 10.12.136.1 sent an invalid ICMP error to a
broadcast.
Mar  5 08:46:46 sla2 kernel: 10.12.136.1 sent an invalid ICMP error to a
broadcast.
Mar  5 08:51:47 sla2 kernel: 10.12.136.1 sent an invalid ICMP error to a
broadcast.
Mar  5 08:56:46 sla2 last message repeated 2 times
Mar  5 09:01:46 sla2 kernel: 10.12.136.1 sent an invalid ICMP error to a
broadcast.
Mar  5 09:03:51 sla2 kernel: 10.12.136.1 sent an invalid ICMP error to a
broadcast.
Mar  5 09:05:14 sla2 kernel: eepro100: wait_for_cmd_done timeout!
Mar  5 09:05:46 sla2 last message repeated 24 times
Mar  5 09:05:48 sla2 last message repeated 3 times
Mar  5 09:05:49 sla2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar  5 09:05:49 sla2 kernel: eth0: Transmit timed out: status 0050  0c80 at
48699/48728 command 00030000.
Mar  5 09:05:49 sla2 kernel: eepro100: wait_for_cmd_done timeout!
Mar  5 09:06:21 sla2 last message repeated 22 times
Mar  5 09:06:22 sla2 kernel: eni(itf 0): TX DMA full
Mar  5 09:06:23 sla2 last message repeated 7 times
Mar  5 09:06:23 sla2 kernel: eepro100: wait_for_cmd_done timeout!
Mar  5 09:06:24 sla2 kernel: eni(itf 0): TX DMA full


At this point both the eth0 interface and atm0 interface stop working.  Note
that the eepro100 times out first and then the eni driver also dies with TX
DMA full error.

ifconfig shows:
[root@sla2 root]# more ifconfig.txt 
atm0      Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          inet addr:10.6.160.254  Mask:255.255.255.252
          UP RUNNING  MTU:1500  Metric:1
          RX packets:4860 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4860 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:408240 (398.6 Kb)  TX bytes:447120 (436.6 Kb)

eth0      Link encap:Ethernet  HWaddr 00:50:8B:D3:92:7C  
          inet addr:216.90.89.xx  Bcast:216.90.89.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:504622 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47444 errors:289 dropped:0 overruns:0 carrier:0
          collisions:1416 txqueuelen:100 
          RX bytes:50644863 (48.2 Mb)  TX bytes:10479503 (9.9 Mb)
          Interrupt:10 Base address:0x2000

Note the collisions are on eth0.  I know that the ICMP error above is caused
by our network configuration (Samba broadcasts a NBNS message on
216.90.89.255 and the 10.12.136.1 is replying with the above error.
Ethereal shows the error as Type 3 (Destination Unreachable) and Code 3
(Port Unreachable)).  For whatever reason this message is being sent (I've
not been able to determine how to stop Nortel Shasta from doing this yet).
I wanted to point out that the eepro100 is timing out and is effecting the
ATM device driver too.

The eepro100 version is:
"eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html\n"
"eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
<saw@saw.sw.com.sg> and others\n";

I know there is a lot of info here, but after reading the thread on the
wait_for_cmd_done, I thought this might shed some light on the problem and
that it may not be confined to the newer/experimental kernels.

Any help would be much appreciated.
regards,
jd wilson
Software Engineer
Savvis Communications