[eepro100] first packet always lost

Raphael Clifford clifford@genomes.rockefeller.edu
Thu Dec 6 12:50:02 2001


I have a network of over 100 linux machines all running kernel 2.2.20
unpatched and with the driver for eepro100 (that comes with the source)
compiled into the source.  I am having network problems which I can't
explain.  The simplest and most reproducible diagnostic is this.

When I ping -s 800 from node to another the first packet is lost and
eventually a "fragment reassembly timeout" error is shown.  However, if
I then ping to the same machine again everything is fine.  If I ping to
another machine you get the error again.  If I then wait for a while
(say the next day) and ping the original machine you the the error
again.  I performed some tcpdumps and the problem is clear (in a way).


>From the pinging machine...

12:51:36.054782 eth0 > arp who-has 172.18.1.23 tell 172.18.1.16
(0:d0:b7:be:5d:35)
12:51:36.055020 eth0 < arp reply 172.18.1.23 is-at 0:d0:b7:be:4a:31
(0:d0:b7:be:5d:35)
12:51:36.055049 eth0 > 172.18.1.16 > 172.18.1.23: icmp: echo request
(frag 3140:1480@0+) (t
tl 64)
12:51:36.055055 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3140:1480@1480+)
(ttl 64)
12:51:36.055059 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3140:1480@2960+)
(ttl 64)
12:51:37.045588 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3141:608@7400)
(ttl 64)
12:51:37.045635 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3141:1480@5920+)
(ttl 64)
12:51:37.045710 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3141:1480@4440+)
(ttl 64)
12:51:37.045749 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3141:1480@2960+)
(ttl 64)
12:51:37.045794 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3141:1480@1480+)
(ttl 64)
12:51:37.045824 eth0 > 172.18.1.16 > 172.18.1.23: icmp: echo request
(frag 3141:1480@0+) (t
tl 64)
12:51:37.046892 eth0 < 172.18.1.23 > 172.18.1.16: (frag 63822:608@7400)
(ttl 255)
12:51:37.047099 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63822:1480@5920+) (ttl 255)
12:51:37.047220 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63822:1480@4440+) (ttl 255)
12:51:37.047343 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63822:1480@2960+) (ttl 255)
12:51:37.047467 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63822:1480@1480+) (ttl 255)
12:51:37.047590 eth0 < 172.18.1.23 > 172.18.1.16: icmp: echo reply (frag
63822:1480@0+) (tt
l 255)
12:51:38.045597 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3143:608@7400)
(ttl 64)
12:51:38.045643 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3143:1480@5920+)
(ttl 64)
12:51:38.045678 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3143:1480@4440+)
(ttl 64)
12:51:38.045705 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3143:1480@2960+)
(ttl 64)
12:51:38.045764 eth0 > 172.18.1.16 > 172.18.1.23: (frag 3143:1480@1480+)
(ttl 64)
12:51:38.045791 eth0 > 172.18.1.16 > 172.18.1.23: icmp: echo request
(frag 3143:1480@0+) (t
tl 64)
12:51:38.046871 eth0 < 172.18.1.23 > 172.18.1.16: (frag 63829:608@7400)
(ttl 255)
12:51:38.047075 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63829:1480@5920+) (ttl 255)
12:51:38.047198 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63829:1480@4440+) (ttl 255)
12:51:38.047321 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63829:1480@2960+) (ttl 255)
12:51:38.047443 eth0 < 172.18.1.23 > 172.18.1.16: (frag
63829:1480@1480+) (ttl 255)
12:51:38.047566 eth0 < 172.18.1.23 > 172.18.1.16: icmp: echo reply (frag
63829:1480@0+) (tt
l 255)
[...]


>From the machine that was pinged



12:42:49.536832 eth0 B arp who-has 172.18.1.23 tell 172.18.1.16
12:42:49.537338 eth0 < 172.18.1.16 > 172.18.1.23: icmp: echo request
(frag 3140:1480@0+) (ttl 64)
12:42:49.537464 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3140:1480@1480+)
(ttl 64)
12:42:49.537583 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3140:1480@2960+)
(ttl 64)
12:42:50.527590 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3141:608@7400)
(ttl 64)
12:42:50.527800 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3141:1480@5920+)
(ttl 64)
12:42:50.527916 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3141:1480@4440+)
(ttl 64)
12:42:50.528039 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3141:1480@2960+)
(ttl 64)
12:42:50.528161 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3141:1480@1480+)
(ttl 64)
12:42:50.528283 eth0 < 172.18.1.16 > 172.18.1.23: icmp: echo request
(frag 3141:1480@0+) (ttl 64)
12:42:50.528545 eth0 > 172.18.1.23 > 172.18.1.16: (frag 63822:608@7400)
(ttl 255)
12:42:50.528566 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63822:1480@5920+) (ttl 255)
12:42:50.528582 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63822:1480@4440+) (ttl 255)
12:42:50.528593 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63822:1480@2960+) (ttl 255)
12:42:50.528637 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63822:1480@1480+) (ttl 255)
12:42:50.528653 eth0 > 172.18.1.23 > 172.18.1.16: icmp: echo reply (frag
63822:1480@0+) (ttl 255)
12:42:51.527465 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3143:608@7400)
(ttl 64)
12:42:51.527679 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3143:1480@5920+)
(ttl 64)
12:42:51.527789 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3143:1480@4440+)
(ttl 64)
12:42:51.527911 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3143:1480@2960+)
(ttl 64)
12:42:51.528033 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3143:1480@1480+)
(ttl 64)
12:42:51.528159 eth0 < 172.18.1.16 > 172.18.1.23: icmp: echo request
(frag 3143:1480@0+) (ttl 64)
12:42:51.528385 eth0 > 172.18.1.23 > 172.18.1.16: (frag 63829:608@7400)
(ttl 255)
12:42:51.528409 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63829:1480@5920+) (ttl 255)
12:42:51.528422 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63829:1480@4440+) (ttl 255)
12:42:51.528433 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63829:1480@2960+) (ttl 255)
12:42:51.528450 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63829:1480@1480+) (ttl 255)
12:42:51.528476 eth0 > 172.18.1.23 > 172.18.1.16: icmp: echo reply (frag
63829:1480@0+) (ttl 255)
12:42:52.527330 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3145:608@7400)
(ttl 64)
12:42:52.527537 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3145:1480@5920+)
(ttl 64)
12:42:52.527654 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3145:1480@4440+)
(ttl 64)
12:42:52.527773 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3145:1480@2960+)
(ttl 64)
12:42:52.527896 eth0 < 172.18.1.16 > 172.18.1.23: (frag 3145:1480@1480+)
(ttl 64)
12:42:52.528018 eth0 < 172.18.1.16 > 172.18.1.23: icmp: echo request
(frag 3145:1480@0+) (ttl 64)
12:42:52.528230 eth0 > 172.18.1.23 > 172.18.1.16: (frag 63842:608@7400)
(ttl 255)
12:42:52.528245 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63842:1480@5920+) (ttl 255)
12:42:52.528254 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63842:1480@4440+) (ttl 255)
12:42:52.528265 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63842:1480@2960+) (ttl 255)
12:42:52.528276 eth0 > 172.18.1.23 > 172.18.1.16: (frag
63842:1480@1480+) (ttl 255)
[...]


The machine that is pinging doesn't seem to send the first packet
correctly.


Is this a driver error?


Cheers,
Raphael


P.S.  For diagnosis I am including the output of mii-diag -a


mii-diag.c:v2.02 5/21/2001 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Using the default interface 'eth0'.
 Basic mode control register 0x2100: Auto-negotiation disabled, with
 Speed fixed at 100 mbps, full-duplex.
 You have link beat, and everything is working OK.
   This transceiver is capable of  100baseTx-FD 100baseTx 10baseT-FD
10baseT.
   Able to perform Auto-negotiation, negotiation not complete.
 Your link partner is generating 100baseTx link beat  (no
autonegotiation).
   End of basic transceiver information.

 MII PHY #1 transceiver registers:
   2100 780d 02a8 0154 05e1 0081 0000 0000
   0000 0000 0000 0000 0000 0000 0000 0000
   0a03 0000 0001 0000 0000 0000 0000 0000
   0000 0000 0b10 0000 0000 0000 0000 0000.
 Basic mode control register 0x2100: Auto-negotiation disabled!
   Speed fixed at 100 mbps, full-duplex.
 Basic mode status register 0x780d ... 780d.
   Link status: established.
   Capable of  100baseTx-FD 100baseTx 10baseT-FD 10baseT.
   Able to perform Auto-negotiation, negotiation not complete.
 Vendor ID is 00:aa:00:--:--:--, model 21 rev. 4.
   No specific information is known about this transceiver type.
 I'm advertising 05e1: Flow-control 100baseTx-FD 100baseTx 10baseT-FD
10baseT
   Advertising no additional info pages.
   IEEE 802.3 CSMA/CD protocol.
 Link partner capability is 0081: 100baseTx.
   Negotiation did not complete.

(the card is forced to 100 base T at startup)

and

./eepro100-diag -eef

eepro100-diag.c:v2.05 6/13/2001 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Intel i82557/8/9 EtherExpressPro100 adapter at 0xef00.

EEPROM contents, size 64x16:
    00: d000 beb7 355d 0203 0000 0201 4701 0000
  0x08: 7213 8309 40a2 000c 8086 0000 0000 0000
      ...
  0x30: 0128 0000 0000 0000 0000 0000 0000 0000
  0x38: 0000 0000 0000 0000 0000 0000 0000 f429
 The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
  Station address 00:D0:B7:BE:5D:35.
  Board assembly 721383-009, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
   Sleep mode is enabled.  This is not recommended.
   Under high load the card may not respond to
    PCI requests, and thus cause a master abort.

and

./eepro100-diag -aaf

eepro100-diag.c:v2.05 6/13/2001 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Intel i82557/8/9 EtherExpressPro100 adapter at 0xef00.

i82557 chip registers at 0xef00:
  0c000050 01634000 00000000 00080002 18250081 00000600
  No interrupt sources are pending.
   The transmit unit state is 'Suspended'.
   The receive unit state is 'Ready'.
  This status is normal for an activated but idle interface.
 The Command register has an unprocessed command 0c00(?!).


I have no idea what that error refers to.