[3c509] 3c509B hang after "too much work in interrupt"

Thu Jul 11 00:55:01 2002

I'm seeing the following difficult-to-reproduce behavior on a 3Com 3c509B:

Under heavy loads I will see an occasional message:

  Jul  9 15:57:47 farm-1 kernel: eth0: Too much work in interrupt, status e401.

which seems mostly-benign on its own.  However, occasionally (>3, <10
times in the last month), we'll come in in the morning to find the
machine unreachable over its ethernet with "Too much work" as the last
message in the syslog.  An ifdown/ifup will bring connectivity back.

My big question is "Is there a fix for this"?  

I have a few subquestions

1.  How do status notifications get turned back on after a 'too much
work'?

When "too much work" happens, the driver turns off status
notification ("indications" in 3com's terminology) for all the
currently-asserted interrupt sources.  The comments say "The timer
will reenable interrupts".  When I look at vortex_timer(), though, I
can't figure out how indications get turned back on. I see the
FakeIntr request, but no obvious SetStatusEnb/SetIndicationEnable.
How does this happen?  Is there some undocumented feature of the
chipset that restores this register after a FakeIntr/RequestInterrupt
call?

2.  Will raising max_interrupt_work help?

If we assume that the "too much work" is leading to the interface
failure, then a partial solution would be to reduce the frequency of
these.   

To that end, I ran some experiments with an instrumented version of
the driver that printed out the number of iterations through the loop
in boomerang_interrupt (when the number of iterations exceeded a
threshold).  I was a little surprised to see that this number was very
rarely large; even under loads that brought the machine to its knees,
I rarely saw iteration counts as high as 10, let alone the 32
necessary to hit the max_interrupt_work threshold and trigger a "too
much work".  I tried to generate extra interrupts by catting files to
/dev/null during the tests, and this didn't seem to increase the
number of loop iterations.

If my experiments are right, they unfortunately skewer my simplistic
model of what causes "too much work" conditions.  If it's not simple
network load, then I don't really understand how to set
max_interrupt_work.  Any suggestions for a better model or values for
max_interrupt_work?

3.  Could this be caused by the FakeIntr from vortex_timer getting
dropped somehow?

                        Many thanks for your time,
                        david rochberg

Details follow:

Kernel 2.4.2 with redhat's patches (that is, the stock RH7.1 kernel),
which contains 3c59x driver "LK1.1.13 27 Jan 2001".  Diffs with more
recent 2.4 kernels show no relevant-seeming changes in the "too much
work" or vortex_timer code.

eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0xdc00,  00:50:04:10:4d:3b, IRQ 3
  product code 5450 rev 00.9 date 12-28-98

I'm running on a testbed of wimpy celeron 400s (HP Vectras) for the
time being.  They're hooked up to a Cisco 3548XL.

When I say "heavy load" I mean routing two sorts of traffic through
the machine:

  several streams of small UDP packets running at a few thousand packets/sec 
  a few TCP streams fast enough to saturate the remaining bandwidth

on some machines, this is traffic is being IPIP-encapsulated (that is,
it comes in IP and leaves over an IPIP tunnel to another machine on
the same switch).  on others, the traffic enters and exits over IPIP tunnels.

on some machines, CPU-intensive userland processes were also running

I mention the IPIP encapsulation because my casual reading of the
source (I've not yet instrumented to make sure) suggests that every IP
packet that gets IPIP-encapsulated must be copied to make additional
headroom in the skbuff, and this would further increase kernel CPU
usage.