[eepro100] eepro100 and Intel STL2

Tim Cutts tim.cutts@incyte.com
Wed, 9 May 2001 10:40:57 +0100


On Wed, May 09, 2001 at 04:02:42AM -0400, Donald Becker wrote:
> On Tue, 8 May 2001, Will Francis wrote:
> 
> > > On Wed, 2 May 2001 wfrancis@incyte.com wrote:
> > 
> > > Our 27Bz-8 Scyld Beowulf version includes the corrections, and STL2
> > > systems are now in our testing lab.
> > 
> > I can not locate the 27Bz-8 distribution on your
> > FTP server.
> 
> Right now it's available only to our partners.
> 
> > Has this driver been released somewhere else? If
> > not, any idea when it might be publicly available?
> 
> In a week or two.

I've been seeing a similar problem to Will, also using STL2
motherboards.  I run a much smaller farm of machines at one of Incyte's
other locations, here in the UK.

The symptoms for me are that jobs doing a lot of NFS reads from the
wedge in a non-interruptible wait on disk.

The network interface is still alive, but the process remains hung.

The wedge is associated with a kernel log message:

kernel: eepro100: cmd_wait for(0xffffff80) timedout with(0xffffff80)!

and then huge numbers of:

kernel: nfs: task 291659 can't get a request slot

Machines based on the Lancewood motherboard do not have the same
problem.  All machines are identically configured, using kernel 2.2.19.
All machines are dual-processor.

This seems to be related to the discussions on this list back in
February, regarding the detection of the receiver lock-up bug.

The older machines enable the work-around:

eth0: Intel PCI EtherExpress Pro100 82557, 00:90:27:F6:2C:37, IRQ 21.
  Receiver lock-up bug exists -- enabling work-around.
  Board assembly 000000-000, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
  General self-test: passed.
  Serial sub-system self-test: passed.
  Internal registers self-test: passed.
  ROM checksum self-test: passed (0x04f4518b).
  Receiver lock-up workaround activated.

The newer machines do not:

eth0: OEM i82557/i82558 10/100 Ethernet, 00:D0:B7:B7:17:A1, IRQ 18.
  Board assembly 000000-000, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
  General self-test: passed.
  Serial sub-system self-test: passed.
  Internal registers self-test: passed.
  ROM checksum self-test: passed (0x04f4518b).

I'm interested to note that the newer machines' eepro100 seems to be
detected as a much more generic card than the older machines.  Is this
correct?

It's interesting that lspci produces quite a lot of "Unknown device"
messages on the STL board, but not on the Lancewood board.  For example,
compare the EEPro entries under lspci -vv on the above two machines:

old:

00:0e.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
        Subsystem: Intel Corporation EtherExpress PRO/100+ Management Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 8 min, 56 max, 64 set, cache line size 08
        Interrupt: pin A routed to IRQ 21
        Region 0: Memory at f4102000 (32-bit, non-prefetchable)
        Region 1: I/O ports at 2800
        Region 2: Memory at f4000000 (32-bit, non-prefetchable)
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- AuxPwr- DSI+ D1+ D2+ PME+
                Status: D0 PME-Enable- DSel=0 DScale=2 PME-

new:

00:03.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
        Subsystem: Intel Corporation: Unknown device 1229
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 8 min, 56 max, 66 set, cache line size 08
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at fb101000 (32-bit, non-prefetchable)
        Region 1: I/O ports at 5400
        Region 2: Memory at fb000000 (32-bit, non-prefetchable)
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- AuxPwr- DSI+ D1+ D2+ PME+
                Status: D0 PME-Enable- DSel=0 DScale=2 PME-

Is this symptomatic of a more generic problem regarding PCI detection on
these motherboards?

Tim.

-- 
Tim Cutts PhD                    Tel: +44 1223 454918
Incyte Genomics
Botanic House, 100 Hills Road, Cambridge, CB2 1FF, UK