AppleTalk crashes EEPro100 on a bonded etherchannel

Osma Ahvenlampi oa@spray.fi
Tue Nov 24 04:15:31 1998


[This post is partially off-topic for both lists, sorry about
that. I'm still posting it to both because it's also partially
on-topic for both, and will probably be of interest to readers 
on both lists.. Please adjust reply addresses accordingly.]

Hardware: 350MHz P-II with two Intel EEPro100's, connected to a Cisco
Catalyst 2924XL on both ports.

Software: Linux 2.0.36 with Beowulf channel bonding patch, eepro100
driver versions 0.99B and 1.04 tested. Netatalk 1.4b2+asun2.1.0.

Problem:

With only one interface enabled and the switch in a standard 
configuration, the system works normally (after the
multicast_filter_limit=3 workaround for the eepro100 bug). 

Enable both interfaces by running 'ifenslave eth0 eth1' and
configuring the switch to use both ports connected to the machine as a 
port group. Tests with ftp indicate channel bonding is successful
(transfer speeds to an NT machine on the same switch jump from 4.8-5.0 
MB/s to 7.5MB/s, and concurrent transfers to two NT machines stay in
the 5 MB/s range).

Start Netatalk. nbprgstr fails to register an AppleTalk address/name
for the system until I add a line "eth0 -phase 2" into
/etc/atalk/atalkd.conf. After adding it, Netatalk starts
normally. This indicates that atalkd can't understand that eth1 is a
slave channel. This might have been acceptable, since with modern
MacOS, EtherTalk isn't used for more than address lookups (Chooser)
anyway.. afpd over TCP/IP probably wouldn't have been a problem on a
bonded channel. Still, it is a lacking in Netatalk.

However, I soon notice that the machine's entire ethernet layer starts
crashing. The error is familiar from the multicast debugging: "kernel:
eth0: Transmit timed out: status 0050 command 0000." All network
traffic stops for 10-20 seconds, Linux notices something is wrong,
resets the ethernet layer, and things work again for a minute or two
until the same repeats.

Last time this problem turned out to be (a still unresolved?) race
condition in the eepro100 driver when the hardware multicast filter
was set up for more than 3 addresses (requiring some kind of
continuation packet in the configuration). If any (multicast?) packets 
were received during the multicast configuration, the driver would
crash.

I've patched my driver on the source level (in case conf.modules
configuration options don't propagate correctly to multiple driver
instances) to work around the problem by limiting the use of the
hardware multicast filter to 3 addresses. With bonded channel, this
isn't enough, though.

Still, it sure sounds like a race condition. Perhaps some static
variable that should be protected by a semaphore or local to a driver
instance is getting clobbered when one card is being configured and
the other receives traffic?

-- 
Against boredom, even the gods themselves struggle in vain.
Osma Ahvenlampi <oa@spray.fi>