2.2.12 network problems (SMP, tulip)

Josip Loncaric josip@icase.edu
Wed Sep 15 10:24:55 1999


Our new dual PIII/500 machines running 2.2.12 have been losing network
connectivity under heavy network load.  By contrast, our old single
PII/400 machines running 2.0.36 have been rock solid.  

I did a number of network stress tests using tulip.c:v0.91j++ (plus
recent mods), and our new SMP nodes are still not reliable.  

I considered three variables: single PII/400 vs. dual PIII/500, Linux
kernel 2.0.36 vs. 2.2.12, and NetGear (PNIC) vs. Kingston (21143) cards. 
The network card driver is tulip.c (v0.91h in 2.0.36, v0.91j++ in
2.2.12).  Both kernels have the same tweaks to optimize MPI performance.

The top culprit appears to be the dual PIII/500 configuration, although 
there are problems with 2.2.12 even in the single PII/400 configuration:

Working configurations:
-----------------------
single/2.0/NetGear  (rock solid)
single/2.0/Kingston (rock solid)
(*)dual/2.0/NetGear (mixed test: single PII/400 <-> dual PII/400 works,
                     but we have only one dual PII/400 node)

Working but not robust:
-----------------------
single/2.2/NetGear  (receiver's network eventually dies on write error)
single/2.2/Kingston (2-3% failed/dropped connections, then network dies)

Serious problems:
-----------------
dual/2.0/NetGear    (receiver's network dies, no warning)
dual/2.0/Kingston   (receiver's network quickly dies on write error)
dual/2.2/Kingston   (network dies with Tx hung message)
dual/2.2/NetGear    (network dies without warning)

Sometimes the receiver's network dies without any messages, but I've
seen "Tx hung..." and even an "IRQ DEADLOCK DETECTED BY CPU0" message on
the receiver's end.  This "IRQ..." warning was produced by the receiver
(dual PIII/500 using 2.0.36/0.91h kernel and Kingston cards) while the
sender was our very reliable dual PII/400 (2.0.36/0.91h, NetGear cards).

These problems persist even over a crossover cable.

It would seem that out ASUS P2B-D motherboard with BIOS 1010 and dual
Pentium III 500 MHz processors is the main factor in this mess. 
Although single nodes are rock solid under 2.0, they are not robust
under 2.2.  The NetGear/Kingston differences are minor by comparison.

Suggestions?

Sincerely,
Josip


-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 132C                       http://www.icase.edu/~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134