[Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?

Sat Mar 31 10:38:50 PDT 2007

Jon

A few things,

1) You may find MPI Link Checker from Microway helpful
in this situation. There was free beta version floating around
at one point. Plus ethtool can be helpful to check and see how
the nodes are connecting to the switch.

2) Also, I never have considered NetGear a high performance
switch. I have seen big improvements in applications
by replacing a cheap switch with a slightly more
expensive better performing switch. Note this may
or may not be your problem, so don't run out
and buy a new switch, but not all switches
hold up under the loads HPC applications
throw at them.

<Soapbox>
I am constantly amazed at how many people buy the
latest and greatest node hardware and then connect
them with a sub-optimal switch (or cheap cables), thus reducing
the effective performance of the nodes (for parallel
applications). Kind "penny wise and pound foolish" as they say.
</Soapbox>

 --
 Doug

> I've been pulling out what little hair I have left while
> trying to figure out a bizarre problem with a Linux
> cluster I'm running.  Here's a short description of the
> problem.
>
> I'm managing a 29-node cluster. All the nodes use
> the same hardware and boot the same kernel image
> (Scientific Linux 4.4, linux 2.6.9). The owner of this
> cluster runs a multi-node MPI job over and over with
> different input data. We've been seeing strange performance
> numbers depending on which nodes the job uses. These
> variations are not due to the input data.
> In some combinations the performance
> is an order of magnitude slower than in others.
> Fooling around with replacing the gigabit ethernet switch,
> replacing two of the nodes, and running memtest all
> day long didn't result in anything interesting.
>
> However, today I took a look at the network statistics
> as shown on the ethernet switch (a Netgear GS748T).
> What I saw was 13 of the 29 switch ports had very large
> numbers of FCS (Frame Checksum Sequence) errors. In fact,
> some had more FCS errors than valid frames, and I'm talking
> about frame counts in the billions. All the other ports
> showed 0 FCS errors. So, something is clearly wrong.
>
> What I'm wondering is what's causing these FCS errors.
> The cables are short and the equipment is new.
> All the nodes use new SuperMicro H8DCR-3 motherboards
> with onboard ethernet controllers so I'm having
> trouble believing that this problem is caused by a
> faulty ethernet controller because this would
> mean that 13 out of 29 controllers are bad.
> Running "ifconfig eth0" on the nodes show no errors
> but I'm not sure if this kind of error is detectable
> by the sender, and I'm guessing that packets with FCS
> errors are dropped by the switch. Could the switch be making
> a mistake while under heavy load when computing
> the FCS values?
>
> I'd like to find the definitive cause of the problem
> before I ask the vendor to replace massive amounts
> of hardware. How would you isolate the cause
> of this problem?
>
> Cordially,
> --
> Jon Forrest
> Unix Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> !DSPAM:460d8c15104951804284693!
>

--
Doug