Fast Ethernet Stalls with Netgear FA310TX Rev. D1
Tom Crockett
tom@icase.edu
Thu Apr 1 09:41:26 1999
Fast Ethernet Stalls with Netgear FA310TX Rev. D1
Several users of the ICASE Coral cluster have noticed problems with
communication stalls at random intervals. A job which is running
normally will suddenly hang for periods of up to a minute or more,
then resume operation with no apparent side effects. The problem is
most pronounced with communication-intensive applications using large
numbers of processors, but it also occurs infrequently with jobs using
as few as two processors.
After extensive investigation and testing, we have determined that the
problem is related to the Netgear FA310TX Rev. D1 cards which we are
using in most nodes of the Coral cluster. So far as we can determine,
the problem does not occur with older Netgear FA310TX Rev. C1 cards.
The C1 cards use a DEC 21140 chip, while the D1 cards use a Lite-On
clone (a.k.a. PNIC).
In our normal configuration, we are running Linux 2.0.36 with version
0.90Q of Don Becker's "tulip.c" Fast Ethernet driver. We have also
observed the problem under Linux 2.2.2. We have not been
able to determine whether the problem is due to hardware, firmware, or
the driver software. For more information about the Coral hardware
and software configuration, see http://www.icase.edu/CoralProject.html.
The two plots in http://www.icase.edu/~tom/Coral/DEC_vs_LiteOn.pdf
illustrate the problem. For both tests, we ran the same parallel
rendering benchmark on two processors, generating more than 32,000
frames of animation. The benchmark code uses LAM 6.2b MPI over TCP
for interprocessor communication, although the problem is also
observed with other communication packages.
Each frame of the animation contains similar imagery, so we expect
rendering times to be tightly bounded. The two tests were run
simultaneously for more than 16 hours, using different pairs of
processors, on an otherwise idle system. For each frame, we plot the
elapsed (wallclock) execution time.
The first plot shows performance using Rev. C1 (DEC) cards. As
expected, rendering times are tightly bounded. The one exception
appears at frames 28,524 through 28,527, and is attributable to the
nightly Linux cron job which steals cycles from the rendering
application.
The second plot shows performance using Rev. D1 (Lite-On or PNIC)
cards. The results show numerous stalls, ranging in duration from 1
to 53 seconds. The nightly cron run is also apparent at frames 28,170
through 28,172, resulting in delays of up to three seconds.
Although these stall events are relatively rare compared to the number
of packets transmitted, the impact on parallel performance can be
severe, particularly when interactive or real-time performance is
required. They also make it very difficult to obtain accurate,
repeatable performance measurements of parallel applications.
Tom Crockett
Josip Loncaric
ICASE
March 31, 1999