[eepro100] eepro100 failing under load

Jon. Hallett jjh@ecs.soton.ac.uk
Wed May 8 11:40:01 2002


We are having major problems with two PC's with Intel Ether Express Pro 100 
server adapters.

When put under load, the adapters stop working.  We can reliably make the 
adapters fail by attempting to back up the machines to a remote tape 
drive.  Any other heavy network use also makes them fail.

The two machines are both running a 2.4.9 kernel with the 1.20 eepro100 driver:

Linux roadrunner 2.4.9-31enterprise #1 SMP Tue Feb 26 06:25:36 EST 2002 
i686 unknown

May  8 16:07:38 roadrunner kernel: eepro100.c:v1.20 1/28/2002 Donald Becker 
<becker@scyld.com>
May  8 16:07:38 roadrunner kernel:   http://www.scyld.com/network/eepro100.html
May  8 16:07:38 roadrunner kernel: eth0: OEM Intel i82559 rev 8 at 
0xf89a7000, 00:30:48:11:09:37, IRQ 31.
May  8 16:07:38 roadrunner kernel:   Board assembly 000000-000, Physical 
connectors present: RJ45
May  8 16:07:38 roadrunner kernel:   Primary interface chip i82555 PHY #1.
May  8 16:07:38 roadrunner kernel:   General self-test: passed.
May  8 16:07:38 roadrunner kernel:   Serial sub-system self-test: passed.
May  8 16:07:38 roadrunner kernel:   Internal registers self-test: passed.
May  8 16:07:38 roadrunner kernel:   ROM checksum self-test: passed 
(0x04f4518b).


When the interfaces fail, lots of these messages appear in /var/log/messages:

May  8 16:04:42 roadrunner kernel: eth0: Transmit timed out: status 
0000  0010 at 530/550 commands 000c0000 000c0000 000c0000.
May  8 16:04:42 roadrunner kernel: eth0: Restarting the chip...
May  8 16:04:42 roadrunner kernel: Command 0070 was not accepted after 
10001 polls!


and eepro100-diag -aee -f produces the following:


eepro100-diag.c:v2.07 12/28/2001 Donald Becker (becker@scyld.com)
  http://www.scyld.com/diag/index.html
Index #1: Found a Intel i82557/8/9 EtherExpressPro100 adapter at 0xd400.
i82557 chip registers at 0xd400:
   00100000 36a86300 00000000 00080002 182541e1 00000000
   No interrupt sources are pending.
    The transmit unit state is 'Idle'.
    The receive unit state is 'Idle'.
   This status is unusual for an activated interface.
  The Command register has an unprocessed command 0010(?!).
EEPROM contents, size 64x16:
     00: 3000 1148 3709 0d1b 0000 0201 4701 0000
   0x08: 0000 0000 48e0 100c 8086 0000 0000 0000
       ...
   0x38: 0000 0000 0000 0000 0000 0000 0000 12da
  The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
   Station address 00:30:48:11:09:37.
   Board assembly 000000-000, Physical connectors present: RJ45
   Primary interface chip i82555 PHY #1.


Any ideas what is going wrong?  We get the same problem with the RedHat 
supplied versions of eepro100 and e100 and with the latest Intel supplied 
version of e100.

Note that we can reliably reproduce the problem, and we'd be entirely happy 
to experiment with updated drivers.

Thanks,

Jon.