[Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?

Mon Apr 2 08:39:36 PDT 2007

To jump in late, let me reiterate that before jumping to any conclusions
concerning proximate cause of your networking difficulty, it is best to
systematically figure out what it might be.

  a) List points of possible failure.  They are in very general terms:
ethernet switch(es), cables, NICs, kernel/drivers, software.

  b) On the switches one can have two or three general classes of
failure.  The simplest and most pernicious is probably "bad ports".  A
port can be "bad" because its internal electronics is screwed, maybe by
spot heating (resulting in intermittant problems or port death) or by
the little bitty wires in the RJ45 socket getting bent or deforming over
time so that solid contact is no longer made with an inserted cable.
This happens -- especially if the wires aren't properly supported going
into the port socket so that they exert a torgue that pressures the
contact wires in a warm system over a long time.  Dust or corrosion can
also contribute to spotty electrical connectivity inside the port
connection itself.

"Bad ports" can usually be identified by the fact that any system
plugged into the port has a high chance of having problems where that
same system, plugged into a different port on a different switch, works
flawlessly.  The solution to bad ports is throw out-of-service-contract
switches away, buy new ones with a four year contract on them (tier 1 or
tier 2 if possible) and move on with life.  Switches are cheap -- human
time is expensive.

  c) cables can always be bad, especially if they are dangling, homemade,
have been moved around a lot.  Again, the little contact pins deform
under certain heat/pressure circumstances.  Also the PLASTIC THEY ARE
MADE OF can deform under warm pressure so that they no longer seat
properly in a port socket, or the little "snap" on the back wears down
so that they can wiggle to where connectivity is mediocre.  Note that in
wiring situations with patch panels, bad cable/port connections extend
recursively between system and port.

"Bad cables" can be detected one of several ways.  The best way is to
invest in a halfway decent cable tester.  The cheap one I own can be
snapped into any socket with a short and RELIABLE patch cable or accept
any cable being tested.  It then transmits voltage on each wire pair one
at a time that is reflected at the far end and lights up little LEDs
that tell you if any pairs are faulting.  This is good for cable breaks
and bad contacts, but can "pass" marginal cables that are making contact
but arcing a bit.

To do better, you have to use a higher quality tester that actually puts
a data signal on the line, or use e.g. a laptop and a secondary reliable
system to test the secondary cabling route compared to a "known good"
point-to-point (direct) cable hookup.  This approach can with care help
detect bad ports on switches as well, although it can be difficult to
resolve problems with switches from problems with ports.

  d) Your problem is very unlikely to be bad NICs, but if it is it will
show up when you use a known-good crossover cable to directly connect
your laptop to the suspect NIC and observe significant problems with
connectivity (bad ping rates, especially on ping floods).  Ping is
actually a fairly powerful tool, as is traceroute.  netpipe is pretty
good for testing interfaces as well.

  e) Integrated kernel/driver/card problems are far from unknown,
especially for certain cards.  At one point in time, for example, RTL
8139 cards were cheap and nearly ubiquitous -- and sucked unimaginably.
They effectively didn't buffer incoming traffic, and one could override
the kernel/driver's ability to process asynchronous arriving packets
with ease.  Consequently they'd "work" until one tried to send a high
speed stream of small packets to one, at which point they'd basically
fail to receive 9 out of 10 of them.  I actually managed to get a
netperf out of a 100 Mbps card of something like 1 Mbps -- the other 99%
of the packets were basically dropped and either lost (UDP) or had to be
retransmitted (TCP).

This is a very difficult problem to diagnose.  General symptoms that
should make you suspect this is a problem include -- having debugged the
switch and physical connection and found that it works fine for "known
good" NICs on both ends.  Getting decent connectivity for certain "low
stress" applications or low loads, but losing more and more packets as
you increase the network load on the NIC.  Obvious kernel-based network
error messages in /var/log/messages.  And the kicker -- buy a known-good
ethernet card, one you are certain works perfectly in the kernel (the
listvolken can probably give you half a dozen recommendations if you
don't already favor 3coms or intels or the like).  Swap it into a system
that is having problems and leaving everything else the same (wire,
port, etc) repeat the test.  If the problem goes away, it's a bad sign
for the NIC.  Solution is probably to just put known good NICs in the
systems and stop using the (usually onboard and sucky) NIC.  NICs are
cheap, time is dear.

Note well that you've already had excellent advice concerning
autonegotiation of duplex and speed.  This SHOULD NOT be a problem with
pretty much any modern (unmanaged) switch and card, but old-timers
remember well that it once was and might be again.  Or if the switch is
a managed switch, then god only knows what state it is in and debugging
this becomes a major PITA.  My own solution is still to dump the switch
and get a new one.  You can get really lovely 48 port dual power supply
gigabit switches for a kilobuck or so from Dell, or you can spend a lot
less and get perfectly reasonable unmanaged switches from any of a
half-dozen vendors.  One good way to debug things is just to get a new
switch and see if it solves all your problems -- how many hours do you
have to waste before a few hundred dollars worth of new hardware is
cheap?

   f) Software (e.g. MPI flavor, PVM, userspace socket code).  This too
is a bitch to debug as there are a near infinity of ways to write buggy
code, sometimes buggy code where the failure only occurs under certain
rarely-accessed modes of operation of the software involved.  The
diagnosis here is one of exclusion.  If ping, traceroute, netperf or
netpipe, hardware network testers, card swapping, switch swapping all
yield no problems but the "problem" persists for your application
throughout, you should suspect that your application is somehow buggy.
This is by no means impossible, depending on just what is being done
with the sockets.  If you conclude that it is likely to BE your
application, you have only two or three possible routes.

     * fix it yourself
     * get somebody to fix it for you
     * throw up your hands in disgust and use a different tool to
accomplish the same task.

Which one works depends on whether the tool is open source or
commercial, your coding skills, availability of service or support
forums, contact with the developers, etc.

The idea of all of the above is to set up a diagnosis tree and run down
it >>systematically<< until you figure out what is wrong.  This method
has never failed me through numerous diagnoses of every possible mode of
failure over twenty years.  I've seen NICs with intermittant
heat-sensitive malfunctions (and figured it out), NICs with visibly
toasted components (afio), bad cables and ports galore (afio), bad
switches (afio), bad drivers (afio), and bad software (afio, without
saying that I was always able to fix the latter).  If you proceed very
systematically you can eventually end up where one answer, however
unlikely it might appear, is the truth.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu