Curious phenomenon

Robert G. Brown
Thu Aug 20 12:41:57 1998

Dear list persons and Don,

I have a variety of dual PPro and PII systems with either tulip
(KNE-100) or eepro100 ethernet cards installed.  The systems are
connected with a Cisco Catalyst 5000 switch.  Yesterday I discovered
that if the switch ports were hard set to full-duplex, 100 Mbps, that
attached systems running 2.0.33/tulip_0.87P or 2.0.35/eepro100_0.99C
would show near-wire-speed (~93 Mbps) consistently for:

netperf -t UDP_STREAM -H target -- -s 65535 -m 1472


netperf -f TCP_STREAM -H target

would show RELIABLE data transmission rates of only around 5 Mbps (as
low as 1.5, as high as 7)!  One such system was an NFS server and its
clients were complaining about lost NFS connectivity fairly
regularly.  Also, scp tests showed absurdly slow remote file copy
rates.  I therefore >>believe<< the latter number.  Although raw packet
rates on the interface were high, the actual number of packets that
were RELIABLY being delivered by the NIC were clearly very low.

When the switch port interface was set to automatic on both speed and
duplex, this problem disappeared.  Both raw UDP and reliable TCP rates
came in at near-wire-speed, and the NFS server problems went away.

My question is:  What gives?  One of the (KNE-100/tulip) systems
involved has been running on a switch port hard set at 100/full for
over two years and was extensively tested on many of the previous
kernel and tulip drivers and this problem simply didn't exist until
fairly recently (just when it appeared I cannot say for sure).  I have
substantial documentation of near-wire-speed connectivity in both TCP
and UDP channels, in addition to NFS benchmarks that clearly indicated
at that time that I was getting reliable 100 Mbps service with the
port hard set to full/100; indeed, list participants will recall that
I published a number of those figures, including rates between tulip
and eepro100 interfaces similarly set, on these lists.  Now the problem
seems to exist for both tulip and eepro100 NICs independently -- I was
able to document it yesterday on both tulip-> eepro100, eepro100->
eepro100, tulip-> tulip, eepro100 -> tulip; if at least one interface
was hard-set to full/100 reliable transmission disappeared and the
problem itself disappeared between all the affected pairs when the
interfaces were reset to automatic.

I should note for the benefit of all list participants that this
probably explains some of the remarkably poor NFS >>client<<
performance on one of the affected systems (which is also my personal
desktop:-(.  If you are similarly seeing absurdly low reliable data
transmission rates on a fast switched interface you should check to
make sure the interface itself is auto/auto (if you can).

I suspect that some sort of framing error or timing error or
negotiation error has been introduced into the drivers somewhere
between tulip 0.7x (cannot remember exactly which one at this point)
where it worked fine with the port set to full/100 and tulip 0.8X
(where it now fails miserably).  Note that the port setting and switch
hardware itself was untouched in all that time -- the only significant
changes I know of are the kernel/driver revisions.

Fortunately, I discovered the fix at the same time I discovered the
problem (the solution in the short run has obviously been to set the
switch interfaces to auto/auto, which is easy enough) but I'd really
like to know if this is a bug that has been introduced or a "feature".
I rather doubt the latter, since regardless of how the card configures
itself or the switch is set (half or full, 100 or 10) it should NOT
lose/corrupt 95% of the transmitted packets in a datastream.  I would
consider this sort of behavior to be "broken".  Since the net device
drivers share significant structural overlap, I suspect that the
problem exists for other NIC drivers as well, is a real bug, and is
probably in the common code.


Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525