Tale of 3 Intel 510T switches and the network that wouldn't work without crc and alignment errors.

Leonardo Magallon leo.magallon at grantgeo.com
Fri Mar 2 23:33:43 PST 2001


Greetings to all beowulfers,


  My last week hasn't been fun at all.   Upon preparation to add 29 new boxes or 58 new processors to our cluster
I had to move our present cluster to give way to shelves (Thanks a lot to all of those who gave me their comments
about the shelves) and relocation of our Oracle systems, I encountered problems with our networking generating crc
and code alignment errors  that did actually take a long while to figure out.
   First of all we are using 3 Intel 510T switches that are connected together in a stack.   We have 8 VA Linux
dual 440BX based computers and 16 white boxes (all 550MHZ) I put together with Gigabyte motherboards and Intel
EtherExpress cards.
    The problem came up when I moved the current group of computers to the opposite wall of the computer room.
For some reason or another all of the white boxes started hitting errors when I tried to move any file across
nfs.  So i tried ftp. It didn't fix it.   I then switched cat5 cables, no luck.   I then moved the cables to
another switch. Still no luck.   I then said.  Well, it's not the cable, its' not the protocol.  It's the card.  I
then placed a 3com card in the box and took my Intel one out.   It didn't work either.
   After that I said well, then maybe my driver needs to be updated.  We are running kernel 2.2.14-6.0.1smp.  I
then upgraded to 2.2.18 and it still didn't work.  So I then went to scyld.com and downloaded the latest drivers
for the Intel card.  By the way, I did put that card back in the box because I knew it wasn't the card..
  I generated the rpm and installed it.  After a reboot the same errors where there.  Hmmmmm.  So I said , hey I
going to force this.   I downloaded the .c drivers, compiled them and inserted them as a module.  After another
reboot, nothing.
   So I then new:
  1)  It is not the kernel.
  2) It is not the driver
  3)  It is not the cable
  4) The problem is with the white boxes ( The VA Linux where in a rack and did not have any errors )
   5)  It is not the NIC.
   6)  It is not the Switch because if I move one of the cables from one of the VA Linux computers to a port that
I know is taking errors, it works just fine.

  To make this LONG story short,  I then thought: " What do this white boxes have in common that is not common
among the VA Linux Computers?".
   I looked and the only thing I could find was that they shared the same power supply.     It was a Toshiba 1400
Se Series giving 2.3Kv.    Not all 16 clones were connected to this single power supply.  So I changed it with one
of the new Toshiba 3000Net I had purchased for the new computers.
    And guess what?   It worked!
  Apparently that power supply was generating noise on the power line that was propagating to the NICs on the
computers  and/or continuing on to the switch.

    Not to mention that before this I moved the rack to another place thinking that there was some kind of RF
interfering with proper functions of the switches.  We even brought an RF meter that did let us know that all
power supplies generate  a big non-oscillating RF field that spans between 6 and 12MHZ but it is stronger at 8
MHZ.  Go figure.   I had placed calls to Intel in three ocassions and finally with Jeff ( I actually didn't get
his last name but if someone asks I can call him back and get it from him; he gave me his direct line) from Intel
support were able to deduce the common thing among all the clones(or drones as we call them -- The Borg ring a
bell?)  was that power supply.

   Sorry about the long email but I thought that this would help anyone in the future that would probably run into
the same kind of problems that I faced this week.


   All is good now,


Regards,

Leo Magallon
Grant Geophysical Inc.
Houston, Texas.








More information about the Beowulf mailing list