[eepro100] Dell 4400 instability with eepro100 driver...

Thu Feb 21 16:30:01 2002

Thank you for your reply... sorry for not responding sooner (I was out of
town). I will get the motherboard changed on the Dell server. At least I
will try... it is always an incredibly frustrating experience calling Dell.

The big difference (that I have experienced) between Dell and Sun is that
with Sun I can call up and say - "Hey I think I need a new motherboard" and
they will send me a replacement with no questions asked. With Dell I have to
prove to them the motherboard is bad. "Yeah - take down your production
system and run our diagnostics program for several days..." It is really
annoying.

Sincerely,

     - Henrik

----- Original Message -----
From: "Donald Becker" <becker@scyld.com>
To: "Henrik Schmiediche" <henrik@stat.tamu.edu>
Cc: <eepro100@scyld.com>
Sent: Monday, February 18, 2002 2:26 PM
Subject: Re: [eepro100] Dell 4400 instability with eepro100 driver...

> On Sat, 16 Feb 2002, Henrik Schmiediche wrote:
>
> > I have a single processor Dell 4400 server with 4GB of RAM that I cannot
get
> > to run stable under high network loads (NFS, remote backups).
>
> This sounds like a hardware problem, and a not a common type of problem.
>
> > I am about ready to trash this system and go back to a Sun.
>
> Don't imagine that Suns don't have obscure hardware problems as well.
>
> [[ Various failures deleted.  ]]
>
> > ...with no success. When I installed the
> > latest eepro100 drivers I get this NMI message which may be related to
the
> > lockups, but I am not sure... I have tried changing RAM with no success.
>
> Yup, it's almost certainly related to the failures.
>
> > Feb 16 07:58:38 s0 kernel: eepro100.c:v1.20 1/28/2002 Donald Becker
> > <becker@scyld.com>
> > Feb 16 07:58:38 s0 kernel:   http://www.scyld.com/network/eepro100.html
> > Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received. Dazed and confused, but
> > trying to continue
> > Feb 16 07:58:38 s0 kernel: You probably have a hardware problem with
your
> > RAM chips
>
> This message might be a little misleading.  Given that you get the NMI
> message just when the driver is accessing the NIC over the PCI bus, my
> guess is that you seeing PCI bus problems.  These are signaled over the
> NMI interrupt, similar to other detected data transfer errors.
>
> The most common reason for a NMI is memory parity errors.
>
> > Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received. Dazed and confused, but
> > trying to continue
> > Feb 16 07:58:38 s0 kernel: You probably have a hardware problem with
your
> > RAM chips
> > Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received for unknown reason 25.
>
> That number '25' is the key to understanding how your machine is broken.
> My guess is that you are getting PCI bus address or data parity errors.
>
> > Feb 16 07:58:38 s0 kernel: Dazed and confused, but trying to continue
> > Feb 16 07:58:38 s0 kernel: Do you have a strange power saving mode
enabled?
>
> Here is where the kernel gives up reporting further errors to avoid
> filling the log.
>
> > Feb 16 07:58:38 s0 kernel: eth0: Intel i82559 rev 8 at 0xf899f000,
> > 00:B0:D0:20:87:60, IRQ 14.
> > Feb 16 07:58:38 s0 kernel:   Board assembly 07195d-000, Physical
connectors
> > present: RJ45
> > Feb 16 07:58:38 s0 kernel:   Primary interface chip i82555 PHY #1.
> > Feb 16 07:58:38 s0 kernel:   General self-test: passed.
> > Feb 16 07:58:38 s0 kernel:   Serial sub-system self-test: passed.
> > Feb 16 07:58:38 s0 kernel:   Internal registers self-test: passed.
> > Feb 16 07:58:38 s0 kernel:   ROM checksum self-test: passed
(0x04f4518b).
>
> All tests passed.  This hints that the errors are occuring when the NIC
> is a PCI target, not a PCI master.
>
> > The error message I get (a whole lot of them):
> >
> > Feb 15 23:35:22 s0 kernel: Command 0080 was not immediately accepted,
10001
> > ticks!
>
> ...but I could be wrong about that.
>
> >    - The eepro100  card shares an interrupt with the SCSI controller. Is
> > there a way to reassign the IRQ of the eepro100 card?
>
> Perhaps, in the BIOS or physically moving the card.  But that's unlikely
> the problem.
>
> >    - The system is even more unstable when I install a second CPU.
>
> Yup.  Could be errors on the memory coherency trafffic.
>
> >  Any ideas on what to try? Bad motherboard?
>
> Yes, likely a bad motherboard.
>
> > NMI:          3
>
> Hmmm, I expect that this count increases over time.  I would track down
> the exact access that triggers the NMI.  But then again, I can pretend
> that I'm doing that to write better more informative error messages and
> diagnostics.  (In reality I just like making things work, even when it
> doesn't make economic sense.)
>
> You should just replace the hardware.
>
>
> > [root@s0:/var/log]# mii-diag
> > Using the default interface 'eth0'.
> > Basic registers of MII PHY #1:  3000 782d 02a8 0154 05e1 41e1 0003 0000.
>
> Thanks for remembering the driver detection message and diagnostic
> information.  This wasn't needed here, but it is for most problems.
>
> Donald Becker becker@scyld.com
> Scyld Computing Corporation http://www.scyld.com
> 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
> Annapolis MD 21403 410-990-9993
>
>