[eepro100] wait_for_cmd_done timeout

Wilson, John John.Wilson@savvis.net
Wed Mar 6 15:02:00 2002


Donald,
Thanks for the reply.  Agreed, the driver is reporting the problem and
appearently the ATM driver is seeing this same problem and reporting it in a
different way.

I've built the eepro100-diag and mii-diag programs.  Thanks for the simple
compile instructions - that helps!  I built the utilities against the
libmii.o lib.

I've restarted the server in question running smbd and nmbd (Samba) as this
is how I can re-create the error.  As I mentioned before I'm new to learning
about this device driver but am somewhat familar with programming at this
level with other OSes.

Do you have any recommendations on whether this driver is better off running
as a module vs added directly to the kernel... which is currenlty how I have
it configured -- mostly due to the previous owners configuration.
thanks 
john
__________________________________________
The output from mii-diag and eepro100-diag:
[root@sla2 bin]# mii-diag   
Using the default interface 'eth0'.
Basic registers of MII PHY #1:  3000 782d 02a8 0154 05e1 0021 0000 0000.
 Basic mode control register 0x3000: Auto-negotiation enabled.
 You have link beat, and everything is working OK.
 Your link partner is generating 10baseT link beat  (no autonegotiation).
   End of basic transceiver information.

[root@sla2 bin]# eepro100-diag -mm -f
eepro100-diag.c:v2.07 12/28/2001 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Intel i82557/8/9 EtherExpressPro100 adapter at 0x4000.
 MII PHY #1 transceiver registers:
  3000 782d 02a8 0154 05e1 0021 0000 0000
  0000 0000 0000 0000 0000 0000 0000 0000
  0400 0000 0001 0000 0000 0000 0000 0000
  0000 0000 0000 0000 0000 0000 0000 0000.
 MII PHY #1 transceiver registers:
   3000 782d 02a8 0154 05e1 0021 0000 0000
   0000 0000 0000 0000 0000 0000 0000 0000
   0400 0000 0001 0000 0000 0000 0000 0000
   0000 0000 0000 0000 0000 0000 0000 0000.
 Basic mode control register 0x3000: Auto-negotiation enabled.
 Basic mode status register 0x782d ... 782d.
   Link status: established.
   Capable of  100baseTx-FD 100baseTx 10baseT-FD 10baseT.
   Able to perform Auto-negotiation, negotiation complete.
 Vendor ID is 00:aa:00:--:--:--, model 21 rev. 4.
   No specific information is known about this transceiver type.
 I'm advertising 05e1: Flow-control 100baseTx-FD 100baseTx 10baseT-FD
10baseT
   Advertising no additional info pages.
   IEEE 802.3 CSMA/CD protocol.
 Link partner capability is 0021: 10baseT.
   Negotiation did not complete.
Monitoring the MII transceiver status.
13:28:23.414  Baseline value of MII BMSR (basic mode status register) is
782d.

I'm assuming that the above status register might change when the error
occurs.

I'll let you know if I find out any more info
-----Original Message-----
From: Donald Becker [...]
Sent: Tuesday, March 05, 2002 14:01
To: Wilson, John
Cc: 'eepro100@scyld.com'
Subject: Re: [eepro100] wait_for_cmd_done timeout


On Tue, 5 Mar 2002, Wilson, John wrote:

> I am seeing a problem with wait_for_cmd_done that is very similar to
timeout
> issue that I found on GeoCrawler.
...
> In summary: It appears to me that the network is being flooded with ICMP
> traffic (and possibly other traffic) and that the eepro100 may not be
> handling the errors/traffic. (I'm new to Linux device drivers, so please
> bear with me here).

There are a bunch of errors reported here.  The device driver does not
cause the errors -- it only reports them.

> I'm running:
> 	RH 7.2
> 	Kernel 2.4.9-13 modified to support the ATM device drivers (eni and
> FORE (Marconi))
> 	ATM on Linux support software: linux-atm-2.4.0
> 	Samba
...
> Mar  5 09:05:14 sla2 kernel: eepro100: wait_for_cmd_done timeout!
> Mar  5 09:05:46 sla2 last message repeated 24 times
> Mar  5 09:05:48 sla2 last message repeated 3 times
> Mar  5 09:05:49 sla2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Mar  5 09:05:49 sla2 kernel: eth0: Transmit timed out: status 0050  0c80
at
> 48699/48728 command 00030000.

You should run eepro100-diag to see more chip status information.
Nothing is obviously wrong from this report.

> Mar  5 09:06:22 sla2 kernel: eni(itf 0): TX DMA full
> Mar  5 09:06:23 sla2 last message repeated 7 times
> Mar  5 09:06:24 sla2 kernel: eni(itf 0): TX DMA full
>
> At this point both the eth0 interface and atm0 interface stop working.
Note
> that the eepro100 times out first and then the eni driver also dies with
TX
> DMA full error.

Yup.  That indicates that there is a system problem that affects both
devices.

> ifconfig shows:

> eth0      Link encap:Ethernet  HWaddr 00:50:8B:D3:92:7C  
>           RX packets:504622 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:47444 errors:289 dropped:0 overruns:0 carrier:0
>           collisions:1416 txqueuelen:100 
>           RX bytes:50644863 (48.2 Mb)  TX bytes:10479503 (9.9 Mb)

> Note the collisions are on eth0.

What type of link partner?  What does 'mii-diag' or 'eepro100-diag -m'
report?

> I wanted to point out that the eepro100 is timing out and is effecting the
> ATM device driver too.

That's not likely what is happening.  While the eepro100 driver is
encountering a problem that causes a timeout, the system workload is
reduced.  Even so, the ATM device driver is reporting a problem.  It
seems more likely that both problems are caused by a third source.

> The eepro100 version is:
> "eepro100.c:v1.09j-t 9/29/99 Donald Becker
> http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html\n"

Grrr, they still refuse to update the URL.

> "eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
> <saw@saw.sw.com.sg> and others\n";
> 
> I know there is a lot of info here, but after reading the thread on the
> wait_for_cmd_done, I thought this might shed some light on the problem and
> that it may not be confined to the newer/experimental kernels.
> 
> Any help would be much appreciated.

Have you tried the driver from
   http://www.scyld.com/network/eepro100.html
      ftp://www.scyld.com/pub/network/eepro100.c

It might not solve the system problem, but it is more likely to report
useful diagnostic information.

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993