[eepro100] mdio_read() timed out

Thu Dec 5 08:39:00 2002

On Thu, 5 Dec 2002, Kevin Hansard wrote:

> My machine is a Dell PowerEdge 2600 with two Intel Pentium/4 Xeon
> 1.8GHz processors, though they appear as four due to Hyperthreading. 
> 
> Relevant part of lspci output is:
> 02:02.0 PCI bridge: Digital Equipment Corporation DECchip 21152 (rev 03)
> 03:04.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 05)

Hmmm, rev. 5. That's not a new card.
(Older NICs are a good thing for debugging -- a new chip might have new bugs.)

> I have modified the code as suggested and here is a sample of the output:
> Dec  5 08:44:16 buzz kernel:  mdio_read() took 25 ticks.
> Dec  5 08:44:16 buzz kernel:  mdio_read() took 1 ticks.
> Dec  5 08:44:16 buzz last message repeated 4 times

Good -- these are about the expected values.  They are consistent with
the observed behavior a few years ago.

> Dec  5 08:44:16 buzz kernel:  mdio_read() timed out with val = 08210000.
> Dec  5 08:44:16 buzz kernel:  mdio_read() took 6401 ticks.

And here is a perfect example of why drivers must always have loop
checks, although many drivers skip them.  Yes, failure is impossible,
but it happens.  Without a loop check the kernel would be in an infinite
loop rather than just printing error messages.

> As you can see the timeout still occurs. Just to make sure I also
> tried the driver with boguscnt = 64*1000 and got: 
> 
> Dec  5 09:45:08 buzz kernel:  mdio_read() timed out with val = 08210000.
> Dec  5 09:45:08 buzz kernel:  mdio_read() took 64001 ticks.

This pretty much confirms that the transceiver is not responding.

> Could the problem be a faulty network card?

There is some sort of hardware problem.
The transceiver part of the chip operates off of different power pins
than the bus interface logic.  You might first check that the power 
supply is putting out the correct voltage and that the system is not
overheating.  If it is a power problem, swapping the NIC to one with
slightly different tolerences might remove symptom but still risk other
failures.

> I swapped the card and the problem was fixed. I hasten to add that
> this is a brand new machine with new cards all supplied by DELL.

And we know how long Dell burns in their machines... 

> Is there anyway I can determine if the mdio_read time outs are
> associated with a particular card or interface (without shutting down
> interfaces, this is a live machine). 

Yes.  This is one of the few error messages that doesn't conform to my
standards: the driver should always print dev->name":" before messages
e.g.
   eth0: Error message with register values.

In this case I never expected the error to trigger, and function only
gets the I/O address, not the structure with the interface name.

I've updated the driver so that next version will print the interface
name before every message.  But that's about thirty lines of changes.
I don't have the few hours that it will take to qualify a new driver
release, especially for what appears to be a one-off hardware problem.
So just do the sleazy one-line change to 

	printk(KERN_ERR "%8.8x: mdio_read() timed out with val = %8.8x.\n",
				   ioaddr, val);

and figure out which interface from the ioaddr.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993