[vortex] 3C905C Hangs

Karyl F. Stein kstein@xenos.net
Wed, 17 Oct 2001 15:26:55 -0400


I've done a little more analysis on this.  I'm leaning away from it being a
kernel or driver issue as I have not been able to reproduce the behavior on
two other machines running the same kernel version and network card.
(Physically moving the network card to the machines.)  I have swapped the
cable, the port on the switch used and the PCI slot of the card without
seeing any change.  I guess I should send this to another forum (any ideas
where?), but I thought I'd pass a few more details to this list to see if
anyone might have insight on what's wrong with this system.

When I do a large (>200M) continuous transfer over the LAN, I will run into
what basically amounts to a network hang.  During this "hang" both inbound
and outbound packets are sent in bursts.  These bursts allow a little bit of
data through with 15 - 20 seconds of no traffic between them.  If you run a
ping during this hang, the output looks something like this:

64 bytes from 192.168.2.146: icmp_seq=10 ttl=255 time=1.637 msec
64 bytes from 192.168.2.146: icmp_seq=0 ttl=255 time=9.994 sec
64 bytes from 192.168.2.146: icmp_seq=1 ttl=255 time=9.002 sec
64 bytes from 192.168.2.146: icmp_seq=2 ttl=255 time=8.002 sec
64 bytes from 192.168.2.146: icmp_seq=3 ttl=255 time=7.002 sec
64 bytes from 192.168.2.146: icmp_seq=4 ttl=255 time=6.002 sec
64 bytes from 192.168.2.146: icmp_seq=5 ttl=255 time=5.002 sec
64 bytes from 192.168.2.146: icmp_seq=6 ttl=255 time=4.003 sec
64 bytes from 192.168.2.146: icmp_seq=7 ttl=255 time=3.003 sec
64 bytes from 192.168.2.146: icmp_seq=8 ttl=255 time=2.003 sec
64 bytes from 192.168.2.146: icmp_seq=9 ttl=255 time=1.003 sec
64 bytes from 192.168.2.146: icmp_seq=27 ttl=255 time=1.212 msec
64 bytes from 192.168.2.146: icmp_seq=11 ttl=255 time=16.001 sec
64 bytes from 192.168.2.146: icmp_seq=12 ttl=255 time=15.001 sec
...

The traffic is not lost, but there is a huge span of time where no traffic
is allowed in or out.  If an arp entry needs to be updated during that dead
time, it will show up as incomplete until the traffic burst occurs.  At that
point, a slew of arp requests and replies show up on the wire and the arp
table is configured correctly.  The arpwatch program does not show any arp
changes on any of the hosts.

To get the "hang" to occur, I do "ssh www-1 tar cf /export/redhat - >
/dev/null" on the backup machine.  Usually, after 5 minutes or so, the link
basically dies.  I have tried the same test on some slower machines with the
same kernel and network card without any issues.

The machine is a "Top Gun" (TX Pro chipset) with an Intel 200 MMX CPU and
64M RAM.  There's a NCR PCI SCSI card and Riva128 PCI video card in it along
with the 3c905c.

> > > Try to update to a more recent kernel like 2.4.9.
> > I'd rather save that one as a last resort.
>
> As others allready said, older 2.4-kernels are rather flaky. I'm
> still using
> 2.2.19 whenever possible. :)

Ok, I'll look into it.

> > Address			HWtype	HWaddress	    Flags
> Mask		  Iface
> > 192.168.2.146          	        (incomplete) 		eth0
> > 192.168.2.145          	ether   00:02:44:0C:61:84   C	eth
> > 192.168.2.144          	ether   00:02:44:0C:61:85   C	eth0
> > Entries: 3	Skipped: 0	Found: 3
>
> You have not only lost the hw-address of your www-1 box, but also lost
> connectivity to a working nameserver, hence the raw adresses.

I lose connectivity to everything.  The nameserver is the 192.168.2.144
machine.
The HW address comes back, though.  The bursts of traffic that occur every
15-20 seconds bring in the ARP replies, so the table gets populated again.
When I ran the above command, though, the entry had expired.  I could see
about 5 ARP requests go out by sniffing the line with tcpdump.  About 10
seconds later, there were a burst of ARP replies allowed in and the table
was back to normal.

> > Kernel Interface table
> > Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
> TX-DRP TX-OVR
> > eth0   1500   0   512915      0      0      0    69151      0
>    0      0
> > lo    16436   0       21      0      0      0       21      0
>    0      0
>
> On layer 3 everything still looks nice.
>
> > Iface
> > 192.168.2.0     0.0.0.0         255.255.255.0   U     0      0
>       0 eth0
> > 127.0.0.0       0.0.0.0         255.0.0.0       U     0      0
>       0 lo
> > 0.0.0.0         192.168.2.144   0.0.0.0         UG    0      0
>       0 eth0
>
> and your default gateway is gone.

Isn't my default gateway 0.0.0.0 -> 192.168.2.144?

> Are you using a routing protocol like RIP2?

Nope.

> I'm thinking along two lines. One is that for some reason you lose the
> *route* to the other machine. The other is that somehow the other machine
> becomes disconnected, which would point to cabling or damaged connectors.
>
> Does this *only* happen during large transfers?

Yes.

> Have you played with any kernel-parameters like /proc/sys/net/ipv4/*?

Nope.

> Could you install arp-watch on 192.168.2.144 or 145 and tell us what if
> anything is happening?

No changes reported.

> Probably unrelated: your nameserver seems to be failing certain queries:
>
> $ host -l -a xenos.net ns-1.xenos.net
> [snip]
> mail.xenos.net	43200 IN	A	65.104.130.145
> mail.xenos.net	43200 IN	HINFO	Intel Pentium 133
> RedHat Linux 7.1
> mymai\000	0 0	0	???
> ListHosts: error receiving zone transfer:
>   result: NOERROR, answers = 168, authority = 49152, additional = 705

Hm, it works for me when doing a remote AXFR.

Thanks,
Karyl