[eepro100] Re: system hangs @ boot-time when bringing up eth0

Andrey Savochkin saw@saw.sw.com.sg
Sat, 9 Sep 2000 13:35:41 +0800


On Fri, Sep 08, 2000 at 10:05:07AM -0400, Donald Becker wrote:
> On Fri, 8 Sep 2000, Andrey Savochkin wrote:
[snip]
> 
> > Each time you upgrade the kernel you take some risk, so you must be prepared
> > to revert to the previous kernel quickly.  Or you may ask what's the problem
> > and be advised how to work-around it in a most convenient way or which driver
> > versions to pick up.
> 
> There is always some risk, but we shouldn't carelessly make the risk higher.

I made fixes for serious problems.
And, in spite of the complaints on the mailing lists, the "no resource"
problem is really not so often.  I didn't have chances to have known about it
before the major kernel release which was installed on a lot of systems.

> Consider the kernel as a hundred subsystems, each of which is "working" or
> "flaky".  Just one "flaky" makes the whole kernel unusable for serious
> work.
>   With "flaky" probability 0.05, we are unlikely to ever get a good kernel.
>   With "flaky" probability 0.01 we have a chance to stabilize

Well, in my personal opinion, there are no bad or good kernels.
There are kernels that behave reasonably well in the area you're interested
in and on the hardware you have, and there are kernels that don't fit you.
I don't know about any single kernel which is suitable for all people...

> The way to avoid this is to test the driver, and proposed driver fixes,
> outside the main kernel development.  That means having a support web page,
> setting up mailing lists, and having almost-released, beta, alpha and
> targeted test versions.  Which is what I've been doing since 1995..
> 
> > Certainly, I regret that the driver has the defect and the inconvenience it
> > caused, but the issue doesn't worth more than a short complain :-)
> 
> That's part of what you signed up for by splitting off your branch of driver
> development.  I give you credit for not doing the usual "patch and run".
> But avoiding introducing new bugs like this was why I had the multi-tiered
> development structure for the individual drivers.  It was much more work for
> me, but in the long run it's the rational only way to do driver development.
> 
> The problem was that from the end-user viewpoint (Linus) it appeared that
> new versions came out only rarely.  He never saw, and never had to see, the
> large numbers of test versions sent out to see if individual problems were
> fixed, or the beta test versions to verify that the fixes worked for most
> people.

I don't see reasons to make network card drivers something special.
All kernel subsystems change.  There are development kernels where the
changes are usually included first (and so was in this case).
There are pre-releases (and that changes were not included at the last
moment, they had been in pre-releases too).
I used the same means to ensure that the changes are right as any developer
of SCSI driver or whatever else.
In this case the changes appeared to be not good.  I regret it, it's a my
fault to some degree.  But I honestly don't think now that I may have done
something else to prevent it.  The only thing which I may have done is to
come up with the permanent fix earlier.  But there are only limited amount of
week-ends... :-)

> > > more curious whatever the driver *was* doing still can't be done that
> > > way, since it seemed to work.  
> 
> > I believe (however, not absolutely sure without the documentation) that the
> > bug has existed in the driver for a very long time, a few years.
> > The sporadic faults started to appear only recently because the operation
> > timings were changed by innocent and unrelated changes.
> 
> If it's a timing problem, and it didn't show up with the previous timing,
> was it a bug before?

A good question :-)
I've managed to get a system which has this "no resource" problem.
The debugging prints showed that all parts of the initialization: RU,
statistic, CU/TX, and configure command run in parallel, i.e. the card doesn't
rise RU ready status bit in between RU initialization and CUStart.
I've checked Intel's and BSD drivers.  They have special code (CU polling) to
ensure a different and strictly serialized order of the initialization steps.
The fact that people haven't reported this problem earlier may be explained
by dozens of different reasons.  The question is how the initialization
should be done properly.

Donald, could you elaborate about the difference in the initialization code
of different drivers, and the initialization policy in your one?

Best regards
					Andrey V.
					Savochkin