interface dies under network load on SMP machines

Frank Koenen koenfr@lidp.com
Thu Aug 27 17:08:28 1998


Hey! ... I've been experiencing this problem with the Adaptec quad
(uses tulip.c drivers) card interface too! ... 

I'm running ipfwadm and maquerade, the whole bit... just implemented
it and things are working great except I get these lockups from time
to time....

For your work-around... how do you go about detecting the problem for
resetting? I'd like to try that myself.

Please let me know if there is any testing I can do for you... and let
me know if there are patches available.

In previous e-mail, Mike Simons said:
> 
> >   Here we go with a new problem with SMP 2.0.35+ (+ meaning also
> > with Alans pre36 kernels) and tulip.c (up to v0.89K - SMPCHECK
> > compiled in).
> > 
> >    The interface goes dead from time to time without leaving any log
> > messages - a simple ifconfig down/up brings them back to live. To
> > stabilise the systems I've written a small program that checks the
> > network and restarts the interface if required. So everything is 
> > nearly perfectly fine...
> > 
> > Here are the tulip-diag outputs for the Kingston KNE 10/100 cards:
> 
> Wolfgang and all...
> 
>     Wonder if this might be a problem outside the tulip driver?
> 
> 
>   I've been seeing the same "ifconfig eth* down/up" problem happening to 
> non-SMP machines, all at 10bT, mostly with 3c509 cards... 
>   This problem seems to take time (more than 14 days) to appear... so 
> it only effects the old (10bT) servers, running newer kernels.
> - tcpdumps from an affected machines show only local arps going out...
> - tcpdumps on other machines show nothing on the wire from these...
> A simple ifconfig up/down fixes the problem.
> 
> kernels:
>   2.0.33 with 3com509
>   2.0.34 with 3com509
>   2.0.35 with 3com509 and DE500
>   2.1.106 with 3com509
> 
>   the last few times it has happened I noted this message on the console
> (with a 3com card w/ 2.0.33) "eth0: Infinite loop in interrupt, status ffff."
>   We have one SMC/tulip machine running 2.0.30... nfs server with 330 days
> of uptime... never happened to that server.  A 2.0.33 server that 
> the problem just happened on today went only 7 days since the last time...
> but was up five months straight without a problem under 2.0.30.
> 
> 
>   Most of the machines in the building are using SMC Tulip cards... and
> they get rebooted a few times a week for a few minutes of Windows (Office),
> so I don't know if the same thing would happen to those cards with new
> kernels.
> 
> --
>     Thanks,
>       Mike Simons
>       Science Applications International Corporation
> 
> 
> 
> don't be too worried about the DE500's on the list above:
> 
>   The DE500's we got recently and I'm still not happy with...
> they don't auto-neg 100Tx... 
>   - the times they have disappeared I
>   - they sometimes take time to get working at 10bT on boot :).
>   - I _think_ they may have just spontaneously decided to 
>     auto-negotiate 100Tx with a 10bT hub when they dropped off.  
> ... need to tinker more before I'm sure there is any problem.
> 
> 


-- 
Frank Koenen (koenfr@lidp.com) [630-960-0133 x634]   ,__o   "If you're not the 
LIDP Systems Wrangler, whipping computers into     _-\_<,    leader, the view
submission for more than 13 years.                (*)/'(*)   never changes"