[eepro100] Re: system hangs @ boot-time when bringing up eth0

Derek Glidden dglidden@illusionary.com
Fri, 08 Sep 2000 14:09:21 -0400

Andrey Savochkin wrote:
> I don't consider the issue as so serious.
> Each time you upgrade the kernel you take some risk, so you must be prepared
> to revert to the previous kernel quickly.  Or you may ask what's the problem
> and be advised how to work-around it in a most convenient way or which driver
> versions to pick up.
> Certainly, I regret that the driver has the defect and the inconvenience it
> caused, but the issue doesn't worth more than a short complain :-)

Wow, boy, I'm just not sure how to respond to that.  I don't want to
come off sounding like a jerk, (because I'm not, really, I swear) but
... "you gotta be kidding me?!"  

On any given box that uses an eepro card with recent kernels, we now
have to be ON-SITE to reboot because the chances of us encountering that
"no resources" bug are pretty damn near 100% on any given day with the
number of machines we have, especially with the boxes that have multiple
NICs. That sounds like a serious issue to me!

Yes, there is a risk whenever we upgrade a kernel, but when we're
dealing with the "stable" kernel branch, we don't expect it to be a risk
between having a kernel that's potentially exploitable/DoS-able and one
that totally flakes out on eepro initialization 25% of the time.  If we
wanted to use a kernel that has flaky drivers, we'd be using 2.4 and
dealing with all sorts of other problems as well.  That's why we stick
with 2.2 kernels: because they're not supposed to have serious issues
like this.  

Things are even worse than that because now we're stuck with the options
of a) using an older kernel, b) trying to back-port all the patches made
against the more recent kernels into an older kernel or c) trying to
forward-port the older eepro driver into recent kernels.  a) sucks
because kind of the whole point of kernel upgrades is to fix stuff that
might be broken and in our case we use patches for stuff like IPVS where
the recent versions only apply to recent kernels, and b) and c) suck
because we don't get paid to hack kernel patches around so our systems
work reliably.

And believe me, if you want to hear more than a "short complain", you
should listen to my boss when he asks me about the problems we're having
and have to send someone out on an hour-long drive to flip the switch on
a remotely-located box...

> I believe (however, not absolutely sure without the documentation) that the
> bug has existed in the driver for a very long time, a few years.
> The sporadic faults started to appear only recently because the operation
> timings were changed by innocent and unrelated changes.

Perhaps it's existed, but whatever those changes are have certainly
aggravated the problem to the extent that it's much more visible.  I've
*never* seen this bug pop up with older eepro drivers.  

Whether or not the bug might have existed before, whatever has been
changed has turned it from something that might occur in very very rare
situations into something that is common, which, logically I think,
would turn it from nearly a non-issue into a major bug.  Again, not
being a kernel hacker, I can't say that I am intimately familiar with
kernel development, but it would seem to me that a situation like that
would require that those changes be backed out, at least from the
"stable" kernel series, until a more concrete reason why this happens
was worked out and a "real" fix was developed.  

Also I must take issue with the term "sporadic faults."  I see it on a
daily basis nowadays.  It's really annoying.

With Microsoft products, failure is not           Derek Glidden
an option - it's a standard component.      http://3dlinux.org/
Choose your life.  Choose your            http://www.tbcpc.org/
future.  Choose Linux.              http://www.illusionary.com/