[eepro100] Transmitter Timeout -- addednum

Kallol Biswas kallol@bugula.fpk.hp.com
Sun, 30 Jul 2000 10:41:29 EDT


I don't know about the latest eepro100 driver, but the version 
I saw had a fundamental design problem, again I will try explain:
   82559 prefetches the next command from the command ring,
suppose the cmd unit is executing ith command and has 
has prefetched the next one, i.e. (i+1)th already, driver 
sets up the the (i+1)th cmd, sets the S bit and sends RESUME,
if the CU:
	*in Suspended state it goes to active state, does not re-read next
link ponter(address for i+1th) re-reads the Sbit of of ith command.
If the Sbit of ith command is cleared then executes the i+1th otherwise
goes back to suspended state.
         *If CU is active it checks the validity of S bits of next(i+1 th)
and present(ith) cmd(PCI cmd 0x6 MR is used to re-read Sbit of a TxCB, I saw
it on analyzer).
Please note that it does not say it re-analize the next(i+1 th) command but
the S bit.

So if the i+1 th command was a previously executed say transmit cmd and 
driver sets up now as a say multicast cmd then the card executes
i+1 th cmd with invalid parameters, and the card stall.

Our initial version of the 82559 driver would hang on an Itanium processor
based system because of this problem, but adding a NOP after a
cmd has solved the problem. Now our stress tests run for days without
any problem on 82559. 

Hope I could make this clear, if you have any question please feel
free to make a call at 973-443-7469/973-442-0164.
I will try to explain as much as I can.

Regards,
Kallol


> 
> 
> A quick re-cap of my hardware:
> 
> * i82557 quad 64-bit PCI (33Mhz) Ethernet card
> * DEC PC164 Motherboard with 21164 EV56 processor.
> 
> I've been messing with eepro100 drivers for about 32 hours straight now
> (with a few hours off for pizza), and as an addednum to my last e-mail,
> this is what I have tried and found thus far:
> 
> * The TX-timeout is not  dependant on what the card is connected to
> afterall.  Regardless of whether it is connected to a   3c905, Bay 350T,
> UB 100-tx hub, or tulip card the "TX-timeout" still happens.  The
> timeout just happens a little quicker when   connected via X-over to a
> 905b. . .
> * All cabling is tried and true on other network cards.
> * The TX-timout occurs on just about all heavy-traffic. . . the initial
> (initial meaing the first timeout since boot) timeout takes a little
> while to happen, but afterwards   the successive time-outs come
> quicker.  Here is a quick table of the occurence of the timeouts in
> regards to the different   driver versions: 
> 
> Traffic			 Driver Version	 Kernel Version   Initial-Timeout(sec)
> Successive Time-outs(sec)  Recovery Time(sec)
> heavy NFS read/writes	 1.06		 2.2.14		  25-30			8-10			   1-2
> mpeg streaming vis SAMBA 1.06		 2.2.14		  35-40			12-15 			   1-2
> HEAVY FTP		 1.06	         2.2.14		  IMMEDIATE		1-2			   4-5
> telnet/ssh/http          1.06		 2.2.14		  NONE			-			   -
> heavy NFS read/writes	 1.09		 2.2.14		  30-45			10-12			   8-10
> mpeg streaming vis SAMBA 1.09		 2.2.14		  115-140		15-20 			   8-10
> HEAVY FTP		 1.09	         2.2.14		  IMMEDIATE		<1			   1-2
> telnet/ssh/http          1.09		 2.2.14		  NONE			-			   -
> heavy NFS read/writes	 1.09		 2.2.16		  30-45			10-12			   8-10
> mpeg streaming vis SAMBA 1.09		 2.2.16		  115-140		15-20 			   8-10
> HEAVY FTP		 1.09	         2.2.16		  IMMEDIATE		<1			   1-2
> telnet/ssh/http          1.09		 2.2.16		  30minutes		???			   a long
> time.
> ALL			 1.09	         2.4.0-test5      N/A*
> *=OS locks IMMEDIATELY after reaching the eepro100 code when compiled in
> the kernel, or upon ismod when running as a module with NO ERROR
> MESSAGES.
> 
> MESSAGES:
> 
> On v1.06 of the driver, this is what /var/log/messages says:
> Jul 25 09:59:12 fosters kernel: eth0: Transmit timed out: status 0050 
> 0000 at 322796/322810 command 000c0000.
> Jul 25 09:59:12 fosters kernel: eth0: Trying to restart the
> transmitter...
> 
> On v1.09 of the driver this is what /var/log/messages says:
> Jul 30 03:25:26 fosters kernel: eth0: Transmit timed out: status 0050 
> 0c00 at 107640/107670 command 200c0000.
> 
> BOOT MESSAGE:
> 
> Jul 29 22:39:31 fosters kernel: eth0: OEM i82557/i82558 10/100 Ethernet
> at 0x9000, 00:08:C7:91:08:72, IRQ 17.
> Jul 29 22:39:31 fosters kernel:   Board assembly 009542-001, Physical
> connectors present: RJ45
> Jul 29 22:39:31 fosters kernel:   Primary interface chip i82555 PHY #1.
> Jul 29 22:39:31 fosters kernel:   General self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Serial sub-system self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Internal registers self-test: passed.
> Jul 29 22:39:31 fosters kernel:   ROM checksum self-test: passed
> (0x24c9f043).
> Jul 29 22:39:31 fosters kernel:   Receiver lock-up workaround activated.
> Jul 29 22:39:31 fosters kernel: eth1: OEM i82557/i82558 10/100 Ethernet
> at 0x9800, 00:08:C7:91:08:73, IRQ 24.
> Jul 29 22:39:31 fosters kernel:   Board assembly 009542-001, Physical
> connectors present: RJ45
> Jul 29 22:39:31 fosters kernel:   Primary interface chip i82555 PHY #1.
> Jul 29 22:39:31 fosters kernel:   General self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Serial sub-system self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Internal registers self-test: passed.
> Jul 29 22:39:31 fosters kernel:   ROM checksum self-test: passed
> (0x24c9f043).
> Jul 29 22:39:31 fosters kernel:   Receiver lock-up workaround activated.
> Jul 29 22:39:31 fosters kernel: eth2: OEM i82557/i82558 10/100 Ethernet
> at 0xa000, 00:08:C7:66:80:F7, IRQ 28.
> Jul 29 22:39:31 fosters kernel:   Board assembly 009545-001, Physical
> connectors present: RJ45
> Jul 29 22:39:31 fosters kernel:   Primary interface chip i82555 PHY #1.
> Jul 29 22:39:31 fosters kernel:   General self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Serial sub-system self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Internal registers self-test: passed.
> Jul 29 22:39:31 fosters kernel:   ROM checksum self-test: passed
> (0x24c9f043).
> Jul 29 22:39:31 fosters kernel:   Receiver lock-up workaround activated.
> Jul 29 22:39:31 fosters kernel: eth3: OEM i82557/i82558 10/100 Ethernet
> at 0xa800, 00:08:C7:66:80:0F, IRQ 32.
> Jul 29 22:39:31 fosters kernel:   Board assembly 009545-001, Physical
> connectors present: RJ45
> Jul 29 22:39:31 fosters kernel:   Primary interface chip i82555 PHY #1.
> Jul 29 22:39:31 fosters kernel:   General self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Serial sub-system self-test: passed.
> Jul 29 22:39:31 fosters kernel:   Internal registers self-test: passed.
> Jul 29 22:39:31 fosters kernel:   ROM checksum self-test: passed
> (0x24c9f043).
> Jul 29 22:39:31 fosters kernel:   Receiver lock-up workaround activated.
> 
> PCI:
> 
> There doesn't seem to be any PCI conflicts and I tried both enabling and
> disabling "PCI quirks" in the kernel with no avail. . .
> 
> Here is a cat of my /proc/pci:
> 
> PCI devices found:
>   Bus  0, device   7, function  0:
>     PCI bridge: DEC DC21154 (rev 2).
>       Medium devsel.  Fast back-to-back capable.  Master Capable. 
> Latency=32.
> Min Gnt=4.
>   Bus  0, device   8, function  0:
>     Non-VGA device: Intel 82378IB (rev 67).
>       Medium devsel.  Master Capable.  No bursts.
>   Bus  0, device   9, function  0:
>     VGA compatible controller: Matrox Millennium (rev 1).
>       Medium devsel.  Fast back-to-back capable.  IRQ 19.
>       Non-prefetchable 32 bit memory at 0x9000000 [0x9000000].
>       Non-prefetchable 32 bit memory at 0x9800000 [0x9800000].
>   Bus  0, device  11, function  0:
>     IDE interface: CMD 646 (rev 1).
>       Medium devsel.  Fast back-to-back capable.  IRQ 21.  Master
> Capable.  Late
> ncy=64.  Min Gnt=2.Max Lat=4.
>       I/O at 0x8000 [0x8001].
>   Bus  1, device   4, function  0:
>     Ethernet controller: Intel 82557 (rev 5).
>       Medium devsel.  Fast back-to-back capable.  IRQ 17.  Master
> Capable.  Late
> ncy=32.  Min Gnt=8.Max Lat=56.
>       Non-prefetchable 32 bit memory at 0xa000000 [0xa000000].
>       I/O at 0x9000 [0x9001].
>       Non-prefetchable 32 bit memory at 0xa100000 [0xa100000].
>   Bus  1, device   5, function  0:
>     Ethernet controller: Intel 82557 (rev 5).
>       Medium devsel.  Fast back-to-back capable.  IRQ 24.  Master
> Capable.  Late
> ncy=32.  Min Gnt=8.Max Lat=56.
>       Non-prefetchable 32 bit memory at 0xa200000 [0xa200000].
>       I/O at 0x9800 [0x9801].
>       Non-prefetchable 32 bit memory at 0xa300000 [0xa300000].
>   Bus  1, device   6, function  0:
>     Ethernet controller: Intel 82557 (rev 5).
>       Medium devsel.  Fast back-to-back capable.  IRQ 28.  Master
> Capable.  Late
> ncy=32.  Min Gnt=8.Max Lat=56.
>       Non-prefetchable 32 bit memory at 0xa400000 [0xa400000].
>       I/O at 0xa000 [0xa001].
>       Non-prefetchable 32 bit memory at 0xa500000 [0xa500000].
>   Bus  1, device   7, function  0:
>     Ethernet controller: Intel 82557 (rev 5).
>       Medium devsel.  Fast back-to-back capable.  IRQ 32.  Master
> Capable.  Late
> ncy=32.  Min Gnt=8.Max Lat=56.
>       Non-prefetchable 32 bit memory at 0xa600000 [0xa600000].
>       I/O at 0xa800 [0xa801].
>       Non-prefetchable 32 bit memory at 0xa700000 [0xa700000].
> 
> 
> and there doesn't seem to be any IO issues:  cat of /proc/ioports:
> 
> 0060-006f : keyboard
> 0070-007f : timer
> 0170-0177 : ide1
> 01f0-01f7 : ide0
> 02f8-02ff : serial(auto)
> 0376-0376 : ide1
> 03c0-03df : vga+
> 03e8-03ef : serial(auto)
> 03f6-03f6 : ide0
> 03f8-03ff : serial(auto)
> 8000-8007 : ide0
> 8008-800f : ide1
> a000000-a00001f : Intel Speedo3 Ethernet
> a200000-a20001f : Intel Speedo3 Ethernet
> a400000-a40001f : Intel Speedo3 Ethernet
> a600000-a60001f : Intel Speedo3 Ethernet
> TRAIL-N-ERROR:
> 
> Forcing different interface speeds via mii-diag does not fix anything:
> 100baseTX-FD -- timeout still occurs
> 100baseTX-HD -- timeout still occurs
> 10baseT-FD   -- timeout still occurs
> 10baseT-HD   -- timeout still occurs
> 
> eepro-diag:
> 
> eepro100-diag.c:v2.02 7/19/2000 Donald Becker (becker@scyld.com)
>  http://www.scyld.com/diag/index.html
> Index #1: Found a Intel i82557 (or i82558) EtherExpressPro100B adapter
> at 0x9000
>  .
> A potential i82557 chip has been found, but it appears to be active.
> Either shutdown the network, or use the '-f' flag.
> Index #2: Found a Intel i82557 (or i82558) EtherExpressPro100B adapter
> at 0x9800
>  .
> A potential i82557 chip has been found, but it appears to be active.
> Either shutdown the network, or use the '-f' flag.
> Index #3: Found a Intel i82557 (or i82558) EtherExpressPro100B adapter
> at 0xa000
>  .
> A potential i82557 chip has been found, but it appears to be active.
> Either shutdown the network, or use the '-f' flag.
> Index #4: Found a Intel i82557 (or i82558) EtherExpressPro100B adapter
> at 0xa800
>  .
> 
> Chainging MACROS:
> 
> v1.06:
> txfifo/rxfifo: changes do nothing
> TX_RING_SIZE/RX_RINGSIZE: changes do nothing
> TX_TIMEOUT:  Increasing this number decreases the freqency of the
> timeouts until the number reaches roughly double what it was originally
> set for, then the interfaces are not usable until an ifdown/ifup
> 
> v1.09:
> txfifo/rxfifo: changes do nothing
> TX_RING_SIZE/RX_RINGSIZE: changes do nothing
> TX_TIMEOUT:  Incresing this number at all makes the interfaces unusable
> until an ifdown/ifup.
> 
> Also, I ported the code from v1.09 to v1.06 for the function "static
> void speedo_tx_timeout(struct net_device *dev)" to see what happens --
> the new "hybrid" driver exhibited the characteristics of the v1.09
> timeouts.
> 
> Lastly, changing txqueuelen via ifconfig does nothing. . .
> 
> Conclusion:
> 
> v1.06 of the driver seemed to handle the TX timeouts a quicker then
> v1.09, but in v1.09 they were less frequent.  I tried to compile v1.10
> and experimental v1.11, but I got all types of compile errors and did
> not have the motivation to port them to v2.2.16 of the kernel after all
> my above failures.
> 
> I have NO IDEA what is causing these TX timeouts. . . if any of the
> gurus here would be as kind as to aide me in my efforts to figure this
> out, I would greatly appreciate it!  I will grant accounts on the
> troublesome machine if that will aide in trouble-shooting, and I will
> code whatever I can if anyone can give me a direction to go in. . . 
> 
> Is there anything special that I have to set in the kernel for 64-bit
> PCI, BTW?
> Could the fact that this card is a 64-bit PCI card be the issue?
> Are there any special parameters that I could try tweaking that are
> alpha-specific?
> 
> 
> Thank you for any help!!
> 
> --Chris
> 
> _______________________________________________
> eepro100 mailing list
> eepro100@scyld.com
> http://www.scyld.com/mailman/listinfo/eepro100
> 


--
Phone: 973-443-7469
Telnet: 1-443-7469
www.kallolbiswas.com
kallol_biswas@hp.com