[realtek] Bug in rtl8129_rx() & other problems

Stephan Brauss sbrauss@optronic.ch
Wed Apr 17 08:57:00 2002


Hello!

>> I think I have found a bug in rtl8129_rx(): It is possible that dev_alloc_skb()
>> is called with a negative argument, which causes my machine to crash.

> What driver version are you using?
> What is the detection message?

Sorry for not giving enough information.
It tried with clean 1.13 and 1.17, with our without memory mapped operations
(USE_MEM_OPS flag). Linux kernel is 2.2.19.

>> My system runs a heavy rtlinux task, that uses about 90% CPU time.
>> Therefore, network interrupts are no more handled so quickly.

> I suspect that the problem you are seeing is related to this.
Mainly yes. But not in all cases. Again, I didn't give enough information. Sorry...
My system is rather slow, a 486 compatible running at 80MHz. Following messages turn
up with a clean 2.2.19 linux kernel (without rtlinux patch) and clean 1.13/1.17 driver:
eth0: RTL8139 Interrupt line blocked, status 1/4/5.
eth0: Transmit timeout, status 0d 0000 media 00.
The reason why I began debugging was the "transmit timeout" message I get from time to time.
In this case, the network communications hangs for some seconds. Because I was the opinion
that my problems could be related to the rather slow CPU, I wrote a simple rttask that
uses up a certain amount of CPU time to make it even slower. With this constellation,
the system crashed.

> Anyway, after many hours of debugging, I have changed the code like follows:

> +                       if(pkt_size<0)
> +                       {
> +                               printk(KERN_ERR"%s: Impossible packet length.\n",dev->name);

> Do you see this message? What is the Rx status when this occurs?
Yes, I get it from time to time when the rttask eats up 90% of CPU time.
I agree that the reason why the crash can occur is that a receive buffer overrun
occurs. In the problematic case, rx_size was always zero. In your driver, pkt_size is rx_size-4,
therefore dev_alloc_skb() is called with a negative value which crashes my system.
Well, I think it is a philosophical question if the kernel should rely on fast enough
handling and a "correct" working network device, that does handle buffer overruns in a way that already
received messages are not overwritten (which is maybe the reason for the problem?).
Anyway, by detecting a pkt_size less than 0, the problem has gone and I think your driver would
be more stable if it is guaranteed that dev_alloc_skb() isn't called with a unallowed
negative value.

>> eth0: RTL8139 Interrupt line blocked, status 4.
>> eth0: RTL8139 Interrupt line blocked, status 5.

> The R-T patches are obviously doing Bad Things.
No, it also turns up with a clean 2.2.19. I retested this morning.

>> eth0: Transmit timeout, status 0d 0000 media 00.

> More badness.
Yes, I know... Unfortunately, it is the reason why I started debugging. I still don't
know how to get rid of it. It occurs seldom but it does.

> relies on having adequate average CPU to handle all pending task.
Oh, yes. I would be happy if our embedded system would have a faster CPU.
But we have this system and (average) CPU load is not reaching 50% at the moment.
I tried to trace the problem because I want the system to be "rock solid".
I want avoid the system to die, if for example a rttask eats up CPU time for a short time.

>> The "Interrupt line blocked" is strange... Could you please explain me
>> the meaning of the "Check for bogusness" comment/code part?

> It's intended to detect the case where the interrupt mapping is bogus,
> or becomes bogus due to an old APIC bug.  SMP implies APIC, so that's
> the SMP tie-in.
The "Interrupt line blocked" message is generated
if the TxOK or RxOK bit is set in IntrStatus when the "Check for bogusness" code
part is executed, yes? So the "problem" is that the rtl8129_timer() routine
expects these bits not to be set? Why is it not allowed that these bits are set?
Why is this check no more necessary for "newer" kernels? (>=20300).
I maybe ask to much (stupid) questions, if you have no time to answer, don't.

This morning I found another cause for "Abnormal interrupt" messages with 1.17:
If I plug out the cable, I get:
	Abnormal interrupt, status 00000020.
And if I plug it in again:
	Abnormal interrupt, status 00002020.

And I made some more tests with a clean 2.2.19 kernel and 1.17 and got:
	Abnormal interrupt, status 00000021.
And the network did not work anymore... After ifconfig eth0 down; ifconfig eth0 up
it worked again... (Same problem that John Horton reported some time ago?)

Thank you & best regards
Stephan


BTW: Maybe interesting for you: Realtek has a new datasheet on the web. Have you 
     seen it? And do you have the rather old programming guide? - I have downloaded
     it some time ago.