Curious failure...

Robert G. Brown
Fri Dec 3 11:25:39 1999

This is just to report an interesting problem I encountered trying to
install RH 6.0 or 6.1 on a SuperMicro dual 300MHz PII with the older
440LX chipset.  Basically the install would go perfectly but when I
tried to boot the new system, the boot would absolutely hang at the
point where the aic7xxx driver was initializing.  This was quite
maddening; given access to the system via a suitable boot floppy I tried
all sorts of things including the installation of a monolithic kernel
and so forth.

Finally I noted that I could boot the system with the linux-up kernel
(uniprocessor) after a clean install.  I discovered further that the
kernel was only identifying the system as having 64 MB of memory instead
of the 384 MB it actually has.  No problem, I added the usual append
line to lilo.conf and voila!  The system would boot EITHER UP or SMP.  A
bit disappointing to learn a) that even a driver-free SMP 2.2 kernel +
aic7xxx driver won't boot in at least this system with only 64 MB to
boot in -- have no idea where the failure lives, but it is most annoying
and occurs consistently with 2.2.5 (RH 6.0) 2.2.10 (homemade), 2.2.12
(RH 6.1) and 2.2.13 (homemade); and b) that 2.2 kernels still don't
autodetect memory on at least some systems that aren't THAT old.

However, now the tulip module won't load in the SMP kernel only.  Or
rather, it loads, but simply doesn't function.  Except sometimes when it
does.  This system has a genuine tulip 21141 KNE-100 that has worked
perfectly under 2.0.x for years, except that it has on occasion been
smitten with the PCI/ioport bug (where the card module loads but
identifies the wrong ioport for the card except sometimes when it gets
it right).  Historically, this bug was identifiable/fixable by
unloading/reloading the module -- the driver would nearly always get the
ioport "right" the second time -- or by using a patch for the tulip
driver that I posted some years ago that uses a routine given by Rubini
in "Writing Linux Device Drivers" (O'Reilly Bronco) for sequentially
testing PCI bus ioports returned during the PCI configuration cycle for
actual read/writability instead of assuming that the first one returned
is actually writeable.

The tulip driver has changed quite a bit and I'm going to have to
actually rewrite the pci code instead of just patching it to test
whether or not this loop/test solves the problem.  I >>can<< report that
just loading/unloading/reloading the module does NOT make the problem go
away (if this is the problem) any longer.  The kernel doesn't seem to
find an interrupt for the card, which may indicate a newer and deeper
problem, but one that "appears" to possibly differentiate between SMP
and UP kernels (no way to be sure without rebooting twenty or thirty
times from powerup situations so that most of the cold-configuration
possibilities and timing situations are statistically explored.

I'm not looking for help (yet) although if anybody wants to offer some
that is fine;-), but I thought it might be useful to get a report of
these problems into the relevant lists so that others might be able to
recognize them and try such solutions as I've developed in the event
that this is happening to others.  I'm hoping to tackle the process of
getting an SMP kernel to work properly next week, so any suggestions I
get before then will be most welcome.


Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525