[vortex] Possible FAQ: 3c905C driver

Tue, 5 Sep 2000 15:43:46 +0200 (CEST)

On Tue, 5 Sep 2000, Sam Wilson wrote:

> Well, after it "drops dead" (as I said, all it normally takes is to
> restart named), sometimes all one needs to do is ifdown, ifup (this is
> rarely the case though). However, on bad cases it seems to kill syslogd
> (syslogd is dumped to the terminal), can't login and the only solution
> is to press the reset button! The worst host has (running RedHat 6.1 +
> RH updates) a GigaByte GA bx2000 motherboard and CPU is a PII 233.

This looks to me very much like an Out-of-Memory situation. Maybe when you
restart named, lots of memory are needed. Then the network driver cannot
get space for receiving packets; if the userland applications need
to receive data before they release the memory, you get a deadlock. Recent
drivers from Andrew have a workaround for this problem: trying to restart
the memory allocation for the receive part while processing for the
transmit part, but this work only if there _is_ some free memory.

Before pressing "reset" you might want to try using SysReq (although
RedHat disables it by default for security reasons, so you have to
enable it); this allows you to get some info about the cause of the death,
or at least to sync your disks (to save fsck at the next boot).

> Once it's been restarted, problem doesn't appear again for about 2-4
> weeks. But again, every time it is triggered by restarting named!

So the network is working well, is not stopping by itself, the condition
is only triggered by named restart. Am I right ?
Why do you need to restart named in the first place ? Does the problem
appear with named restarted only after 2-4 weeks, or any named restart
(like after 5 minutes of running) triggers it ?
What does top or vmstat show (running on the console) when this 
happens ?

> ;-) Ok. The one host that seems to "drop dead" the most has
> http/ftp/smtp/pop3 and primary dns (with about 30 domains). Servicing
> these 30 domains plus 1500 ISP customers. Do you want some more
> quantitative load measures? 

I was more interested in packet counts... But with all these services, I
assume that they are quite high. What is the mean CPU load, what is the
RAM size (and swap size, if you use) ?

> Um, the other thing is that we had two epic100 NICs in this host,
> similar things happened, but I was able to find a posting to
> linux-kernel that pointed to the exact problem. Perhaps it's got nothing
> to do with the 3com or epic100 driver?

Can you give more details here ? What was the exact problem (or maybe a
link to the l-k post) ? If it has to do with OOM situation, there is very
likely that any network driver would behave the same.

> I've put the 16Aug00 driver on one host that has little/no load. 

OK, but what is the point ? You want to use it in the servers, right ? For
testing, you should create a similar environment, running the same
daemons, faking the load...

Sincerely,

Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De