Problems with Tulip ethernet on diskless cluster

Thu Sep 7 07:37:54 PDT 2000

Hi all, 

  we have a 16 diskless-nodes cluster working since Dec 1999. Trying fftw
(library for FFT) I noticed a strange behaviour (that is, running the
benchmark in mpi, I get a lower performance than running it on a single
machine, even using 8 nodes). 
 The configuration is :

   1 server w/ DLink 4 port fast ethernet card, using 4 tulip chips, 2
UW Scsi2 IBM hard drive in software RAID 1, p III 500, 128 Mb

  16 diskless nodes w/ 3com fast ethernet card, p III 500, 128 Mb

   1 3com Superstack II 3300 XM switch.

 The ethernet drivers are all updated to the latest version, we're using
RedHat 6.1 as Linux distro and LAM-Mpi 6.3 for parallel comms (but we
tried with mpich with the same results).

 The only strange thing I noticed is the output from "cat
/proc/net/dev" on the server :

eth0:199467368 1848969 0 0 0 0 0 0 268603237 550016 405696 0 0 0 405696 0
eth1:1015790525 10879129 0 0 0 0 0 0 4262257137 9926748 419750 0 0 0 
419750 0
eth2:56208013 216130 0 0 0 0 0 0 73798 312 317023 0 0 0 317023 0
eth3:28227811 118504 0 0 0 0 0 0 99299 1036 254667 0 0 0 254667 0

 that is, in Rx we have no prob at all, in Tx almost avery packet get a
carrier error. Note : the switch management soft doesn't report any error
on the ports connected to eth1,2 and 3.

 All ports are configured as 100baseTx-FD (as reported from mii-diag) and
so the switch.

 I have no clue on what is happening, especially considering the fact that
the network apparently is working correctly.

 Any idea ?

 Thank you all in advance,  

Franz.

---------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : marini at pcmenelao.mi.infn.it
---------------------------------------------