[eepro100] Quad port Compaq NC3134/35 i82559 = IRQ 23 is physically blocked

Wed Feb 20 18:41:01 2002

On Wed, 20 Feb 2002, Claude LeFrancois (LMC) wrote:

> I try to install/configure a quad port Compaq NC3134 equipped with the
> NC3135 module into a server system. The NC3134 is a dual board, NC3135
> is a module installed on top of the NC3134 which provides 2 extra 10/100
> ports for a total of 4 ports. The board is a PCI 64 bit card. All the
> four ports are i82559 chipsets (eepro100).

If I'm thinking of the same board, the primary board contains a 21152 bus
bridge.  The daughterboard has only the two '559 chips on the PCI bus.

> ... This
> system is also equipped with dual on-board i82559. It makes a total of 6
> i82559. The server runs RedHat 6.2 over a 2.2.17 kernel.
> 
> The problem resides in the fact that 2 NICs are not working well. I got
> this message:
>
>     eth0: IRQ 23 is physically blocked! Failing back to low-rate
> polling.
> 
> It looks like an IRQ/IOAPIC problem. The faulty ports (both module ports
> on NC3135) are sharing IRQs with their parents (main ports on NC3134):

As you guessed, this indicates an IRQ mapping problem.  And the APIC
table is usually to blame.

The quick work-around -- and one that Scyld always ships by default for
2.2 kernel -- is to use the "noapic" kernel option.  This results in
unbalanced interrupts, but this can actually be good in some SMP
environments.

It is possible that the IRQ isn't really blocked, just that there is a
race condition where the other CPU is currently handling the interrupt.
You can check this by starting up only eth0 and checking the interrupt
count. But I'm guessing from the low interrupt count that we really do
have a problem here. 

>  22:          4          3   IO-APIC-level  eth1, eth3
>  23:          4          4   IO-APIC-level  eth0, eth2
...
>  28:        277        516   IO-APIC-level  eth5
>  31:        279         92   IO-APIC-level  eth4

Yup, not many interrupts are getting through.  Does the count ever go
up?

It is curious that there are two IRQ assigned (I'm guessing INTA and
INTB pins) rather than one or four.

>  The board finally works but give a really slow rate:
> 
>     [root@lmcx2 /root]# ping 192.166.0.1
>     PING 192.166.0.1 (192.166.0.1) from 192.166.50.1 : 56(84) bytes of
> data.
>     eth0: IRQ 23 is physically blocked! Failing back to low-rate
> polling.
>     64 bytes from 192.166.0.1: icmp_seq=0 ttl=255 time=13.367 sec
>     64 bytes from 192.166.0.1: icmp_seq=1 ttl=255 time=12.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=2 ttl=255 time=11.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=3 ttl=255 time=10.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=4 ttl=255 time=9.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=5 ttl=255 time=8.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=6 ttl=255 time=7.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=7 ttl=255 time=6.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=8 ttl=255 time=5.370 sec
>     64 bytes from 192.166.0.1: icmp_seq=9 ttl=255 time=4.370 sec

This is exactly what is expected when the interrupt isn't getting
through.  The driver eventually decides to give up and processes all of
the packets in the Rx ring.

The low-rate polling isn't intended to work well.  Instead it's a
fall-back so that you can ssh to the server and figure out what is
broken.  To do high-throughput polling the driver would need many more
Rx buffers and access to 1000+ Hz polling rather than the kernel's
standard 100Hz timer ticks.

> IO APIC #5......
...
> IRQ22 -> 6
> IRQ23 -> 7
> IRQ26 -> 10
> IRQ27 -> 11
> IRQ28 -> 12
> IRQ30 -> 14
> IRQ31 -> 15
...
> PCI->APIC IRQ transform: (B0,I4,P0) -> 28
> PCI->APIC IRQ transform: (B0,I5,P0) -> 26
> PCI->APIC IRQ transform: (B0,I5,P1) -> 27
> PCI->APIC IRQ transform: (B0,I6,P0) -> 31
> PCI->APIC IRQ transform: (B0,I15,P0) -> 10
> PCI->APIC IRQ transform: (B1,I0,P0) -> 30
> PCI->APIC IRQ transform: (B3,I4,P0) -> 22
> PCI->APIC IRQ transform: (B3,I5,P0) -> 23
> PCI->APIC IRQ transform: (B3,I6,P0) -> 22
> PCI->APIC IRQ transform: (B3,I7,P0) -> 23
...
> eepro100.c:v1.19 12/19/2001 Donald Becker  <mailto:becker@scyld.com>
> <becker@scyld.com>
>   http://www.scyld.com/network/eepro100.html
> <http://www.scyld.com/network/eepro100.html> 
> eth0: OEM Intel i82559 rev 8 at 0xe0843000, 00:02:A5:DA:80:75, IRQ 23.
> eth1: OEM Intel i82559 rev 8 at 0xe0845000, 00:02:A5:DA:80:74, IRQ 22.

These are the problem interfaces on the daughtercard, correct?

(I expected the daughtercard interfaces to be eth2 & 3.)

> eth2: OEM Intel i82559 rev 8 at 0xe0847000, 00:02:A5:D6:4A:C3, IRQ 23.
> eth3: OEM Intel i82559 rev 8 at 0xe0849000, 00:02:A5:D6:4A:C2, IRQ 22.

And these are on the base PCI card and work fine.

> eth4: OEM Intel i82559 rev 8 at 0xe084b000, 00:30:48:11:FE:68, IRQ 31.
> eth5: OEM Intel i82559 rev 8 at 0xe084d000, 00:30:48:11:F7:62, IRQ 28.

And these are on the motherboard.  (On-motherboard devices are always
last, designed so that a plug-in card overrides a potentially broken
on-board device.)

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993