[Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?

Fri Mar 30 20:59:11 PDT 2007

Jon Forrest wrote:
> I've been pulling out what little hair I have left while
> trying to figure out a bizarre problem with a Linux
> cluster I'm running.  Here's a short description of the
> problem.
> 
> I'm managing a 29-node cluster. All the nodes use
> the same hardware and boot the same kernel image
> (Scientific Linux 4.4, linux 2.6.9). The owner of this
> cluster runs a multi-node MPI job over and over with
> different input data. We've been seeing strange performance
> numbers depending on which nodes the job uses. These
> variations are not due to the input data.
> In some combinations the performance
> is an order of magnitude slower than in others.
> Fooling around with replacing the gigabit ethernet switch,
> replacing two of the nodes, and running memtest all
> day long didn't result in anything interesting.

Interesting, are all the BIOS set the same?  In particular
APIC settings.

If you switch a cable with a good node and a bad node does
the problem move?

A good GigE port and a bad GigE port?

Are some port autonegotiating to 100 mbit?  (check the switch
management and/or dmesg)

> However, today I took a look at the network statistics
> as shown on the ethernet switch (a Netgear GS748T).
> What I saw was 13 of the 29 switch ports had very large
> numbers of FCS (Frame Checksum Sequence) errors. In fact,
> some had more FCS errors than valid frames, and I'm talking
> about frame counts in the billions. All the other ports
> showed 0 FCS errors. So, something is clearly wrong.
> 
> What I'm wondering is what's causing these FCS errors.
> The cables are short and the equipment is new.

Neither means they aren't defective.

> All the nodes use new SuperMicro H8DCR-3 motherboards
> with onboard ethernet controllers so I'm having
> trouble believing that this problem is caused by a
> faulty ethernet controller because this would
> mean that 13 out of 29 controllers are bad.

Not uncommon, if they were produced in the same production
run they could have the same problem.  What if you switch GigE
ports (assuming the motherboard has 2)?

Oh, I've seen a few cases lately of identical mac addresses (myrinet
and iwill) so I'd check, very strange things can happen with identical
mac addresses.

> Running "ifconfig eth0" on the nodes show no errors
> but I'm not sure if this kind of error is detectable
> by the sender, and I'm guessing that packets with FCS
> errors are dropped by the switch. Could the switch be making
> a mistake while under heavy load when computing
> the FCS values?

I've seen it before, I have a code that keeps all ports busy with
small packets (measuring latency) and then all ports busy with
bandwidth (measuring bandwidth), I could send it if you want.

> I'd like to find the definitive cause of the problem
> before I ask the vendor to replace massive amounts
> of hardware. How would you isolate the cause
> of this problem?

It's a pain in the ass, but the only troubleshooting that is
going to be definitive is to make a positive coorelation that
the problem follows the motherboard port, gige port, cable,
bios setting, etc.  They are all running the exact same
kernel right?  Is it possible 1/2 got patched with a 4.4
update and the other half didn't?

> 
> Cordially,