weird network DoS LX164, Tulip, RedHat

Thu Dec 10 06:39:01 1998

Hello everybody,

We have a couple of Alphas (LX164), running RedHat Linux 4.2. The
hardware configuration is rather simple: 

* LX164 motherboard, 128Mb of RAM, 128Mb of swap space

* SRM console

* headless machine (no kbd/mouse/videocard) - serial console

* SCSI controller: Intraserver NCR 53c8xx based. Detected as:

kernel: ncr53c8xx: at PCI bus 0, device 6, function 0
kernel: ncr53c8xx: PCI_LATENCY_TIMER=0, bursting should'nt be allowed.
kernel: ncr53c8xx: PCI_CACHE_LINE_SIZE not set, features based on CACHE LINE SIZE not used.
kernel: ncr53c8xx: 53c875 detected
kernel: ncr53c875-0: rev=0x04, base=0x9001000, io_port=0x9000, irq=16
kernel: ncr53c875-0: NCR clock is 40218KHz, 40218KHz
kernel: ncr53c875-0: ID 7, Fast-20, Parity Checking
kernel: ncr53c875-0: on-chip RAM at 0x9002000
kernel: ncr53c875-0: restart (scsi reset).
kernel: ncr53c875-0: copying script fragments into the on-chip RAM ...
kernel: scsi0 : ncr53c8xx - revision 2.6n

* SCSI hard drive:

  Vendor: SEAGATE   Model: ST19101W          Rev: 0014
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
scsi : detected 1 SCSI disk total.
ncr53c875-0-<0,0>: FAST-5 WIDE SCSI 10.0 MB/s (200 ns, offset 15)
SCSI device sda: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
	...
ncr53c875-0-<0,0>: FAST-20 WIDE SCSI 40.0 MB/s (50 ns, offset 15)

(we force fast SCSI by doing 'echo "setsync 0 12" > /proc/scsi/ncr53c8xx/0')

* DEC DS21140 Tulip network card running in 100Mbps full duplex:

tulip.c:v0.90 10/20/98 becker@cesdis.gsfc.nasa.gov
eth0: Digital DS21140 Tulip at 0x8800, 00 c0 f0 31 ab 02, IRQ 19.
eth0:  EEPROM default media type Autosense.
eth0:  Index #0 - Media MII (#11) described by a 21140 MII PHY (1) block.
eth0:  MII transceiver #1 config 3000 status 7829 advertising 01e1.
  PCI latency timer (CFLT) is unreasonably low at 0.  Setting to 64 clocks.
	...
eth0:  Advertising 01e1 on PHY 0 (1).
	...
eth0: The transmitter stopped!  CSR5 is fc678006, CSR6 320e2202.
eth0: Setting full-duplex based on MII Xcvr #1 parter capability of 41e1.

The machines run the kernel 2.0.30 with patches from Redhat as well as
a serial console patch.

The problem we have is some random network outages occuring with these
machines - sometimes one or another of them just ceases any network
activity (including responding to pings). The machine remains
functional in that it allows login from a console and the problem may
be remedied by bouncing the network interface.

One indication which often shows up (not always) is a bunch of the
following messages:

kernel: Couldn't get a free page.....
kernel: eth0: Memory squeeze, deferring packet.
last message repeated 13 times
kernel: eth0: Too much work at interrupt, csr5=0xfc6980c0.

We've made an attempt to overcome the problem by using a new kernel,
namely axp_linux-2.0.34 (a patched 2.0.34) from ftp.digital.com. The
latter had a broken Tulip driver, so I've upgraded it to tulip 0.90.
However, this didn't quite help. We had a very same outage the very
next day, yet it didn't display messages above. What it did show were
a couple of alignment traps:

Couldn't get a free page.....
kernel: unaligned trap at fffffc0000364d54: fffffc00078b0046 28 2
kernel: unaligned trap at fffffc0000364e10: fffffc00078b0046 28 1
kernel: unaligned trap at fffffc0000364eb0: fffffc00078b0056 28 2
kernel: unaligned trap at fffffc0000364eb8: fffffc00078b0056 28 3
kernel: unaligned trap at fffffc000037d154: fffffc0000200056 28 16
kernel: unaligned trap at fffffc000037d154: fffffc00078b2056 28 16
kernel: unaligned trap at fffffc000037d154: fffffc0007fd583e 28 16

According to system map, the latter occures somewhere in 'ip_rcv..ng',
whatever this routine may be...

If anybody has or had similar problems, please help.

Thanks in advance,

-- 
Alexander L. Belikoff
Bloomberg L.P. / BFM Financial Research Ltd.
abel@vallinor4.com, abel@bfr.co.il