[Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?

Jon Forrest jlforrest at berkeley.edu
Wed Mar 28 16:52:01 PDT 2007


I've been pulling out what little hair I have left while
trying to figure out a bizarre problem with a Linux
cluster I'm running.  Here's a short description of the
problem.

I'm managing a 29-node cluster. All the nodes use
the same hardware and boot the same kernel image
(Scientific Linux 4.4, linux 2.6.9). The owner of this
cluster runs a multi-node MPI job over and over with
different input data. We've been seeing strange performance
numbers depending on which nodes the job uses. These
variations are not due to the input data.
In some combinations the performance
is an order of magnitude slower than in others.
Fooling around with replacing the gigabit ethernet switch,
replacing two of the nodes, and running memtest all
day long didn't result in anything interesting.

However, today I took a look at the network statistics
as shown on the ethernet switch (a Netgear GS748T).
What I saw was 13 of the 29 switch ports had very large
numbers of FCS (Frame Checksum Sequence) errors. In fact,
some had more FCS errors than valid frames, and I'm talking
about frame counts in the billions. All the other ports
showed 0 FCS errors. So, something is clearly wrong.

What I'm wondering is what's causing these FCS errors.
The cables are short and the equipment is new.
All the nodes use new SuperMicro H8DCR-3 motherboards
with onboard ethernet controllers so I'm having
trouble believing that this problem is caused by a
faulty ethernet controller because this would
mean that 13 out of 29 controllers are bad.
Running "ifconfig eth0" on the nodes show no errors
but I'm not sure if this kind of error is detectable
by the sender, and I'm guessing that packets with FCS
errors are dropped by the switch. Could the switch be making
a mistake while under heavy load when computing
the FCS values?

I'd like to find the definitive cause of the problem
before I ask the vendor to replace massive amounts
of hardware. How would you isolate the cause
of this problem?

Cordially,
-- 
Jon Forrest
Unix Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu



More information about the Beowulf mailing list