Problems with Tulip ethernet on diskless cluster

Steve Wickert wickert at proteinpathways.com
Thu Sep 7 10:49:19 PDT 2000


We also recently had problems similar to what you're reporting.  We were using
DE-500 and Kingston KNE-110 NICs, and Tulip 0.89, 0.91 (I think), and 0.92 (the
most recent, I think).

The media detection for the older Tulip drivers was pretty simple, and they
would lock up if the network connection was lost even momentarily (we noticed
this while using a small test configuration that had a slightly flaky switch). 
If I remember, this affected both the DEC and Kingston cards under the older
Tulip drivers.

Updating to Tulip 0.92 (http://www.scyld.com/network/tulip.html) fixed this
problem for both NICs.  However, we were still seeing what looked like duplex
problems, as you describe.  In all cases, the *hardware* (NICs and switches,
etc.) thought it was in full duplex at 100 Mbits/s.  The media detection
reported by the cards and the tulip-diag utility agreed with this also.

However, with the DE-500s, we were seeing netperf TCP_STREAM numbers that were
one to three orders of magnitude lower than they should be, usually in one
direction only.  The slowdown factor was NOT the same in all cases.  UDP tests
showed the full bandwidth in both directions (if I recall), but most of the
packets appeared corrupted.  I originally suspected a wiring problem, but our
wiring was good.

We also have a few NT machines, and the DE-500s were showing the same behavior
under NT (although there is an updated driver that seems to fix it).  Since we
had had other problems with the DE-500s, I just replaced most of the ones we
had with Kingston NICs, and we haven't had any more (major) problems.

I can try to dig up more details on the netperf tests if anyone would like, and
I'd be interested to hear if anyone else is having problems with the new Tulip
driver.

Steve Wickert
Protein Pathways
	

On Thu, 07 Sep 2000, you wrote:
> Hi all, 
> 
>   we have a 16 diskless-nodes cluster working since Dec 1999. Trying fftw
> (library for FFT) I noticed a strange behaviour (that is, running the
> benchmark in mpi, I get a lower performance than running it on a single
> machine, even using 8 nodes). 
>  The configuration is :
> 
>    1 server w/ DLink 4 port fast ethernet card, using 4 tulip chips, 2
> UW Scsi2 IBM hard drive in software RAID 1, p III 500, 128 Mb
> 
>   16 diskless nodes w/ 3com fast ethernet card, p III 500, 128 Mb
> 
>    1 3com Superstack II 3300 XM switch.
> 
>  The ethernet drivers are all updated to the latest version, we're using
> RedHat 6.1 as Linux distro and LAM-Mpi 6.3 for parallel comms (but we
> tried with mpich with the same results).
> 
>  The only strange thing I noticed is the output from "cat
> /proc/net/dev" on the server :
> 
> eth0:199467368 1848969 0 0 0 0 0 0 268603237 550016 405696 0 0 0 405696 0
> eth1:1015790525 10879129 0 0 0 0 0 0 4262257137 9926748 419750 0 0 0 
> 419750 0
> eth2:56208013 216130 0 0 0 0 0 0 73798 312 317023 0 0 0 317023 0
> eth3:28227811 118504 0 0 0 0 0 0 99299 1036 254667 0 0 0 254667 0
> 
>  that is, in Rx we have no prob at all, in Tx almost avery packet get a
> carrier error. Note : the switch management soft doesn't report any error
> on the ports connected to eth1,2 and 3.
> 
>  All ports are configured as 100baseTx-FD (as reported from mii-diag) and
> so the switch.
> 
>  I have no clue on what is happening, especially considering the fact that
> the network apparently is working correctly.
> 
>  Any idea ?
> 
>  Thank you all in advance,  
> 
> Franz.
> 
> 
> ---------------------------------------------
> Franz Marini
> Sys Admin and Software Analyst,
> Dept. of Physics, University of Milan, Italy.
> email : marini at pcmenelao.mi.infn.it
> --------------------------------------------- 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list