3C905B problem

Martin Siegert siegert@sfu.ca
Tue Apr 18 17:15:53 2000


Hi there,

I am running a small beowulf cluster (8 dual processor PIII-500MHz)
using RedHat 6.1, but with the kernel upgraded to 2.2.14 (SMP).
The master node has three ethernet cards, all 3com 3c905B's.
(eth0 for the outside world, eth1 to the switch that connects to the
other nodes, and eth2 to the backup net). Starting yesterday eth1
stops working "out of the blue" (after running without problems
uninterrupted for 74 days). The symptoms are: ssh and rsh (tcp) stop
working, ping (icmp) stops, but ruptime (udp) still works.
"ifconfig eth1 down;ifconfig eth1 up" brings the interface up again.
This happened twice yesterday and already twice today. There is nothing
in the logfiles that indicates a problem. Furthermore, the 
"ifconfig eth1 down;ifconfig eth1 up" randomly causes some of the
nodes to hang (this time ssh/rsh stop working, ruptime stops as well,
but ping still works; however I can't even login from the console so
that the only choice is to press the reset button on those nodes).
In this case the syslog shows the message 
b05 kernel: nfs: server b01 not responding, timed out
just before it hangs. I am using the 3Com's 3c90x.o module from 
http://support.3com.com/infodeli/tools/nic/linuxdownloading.htm
on all nodes.
Has anybody experienced similar failures?
Any suggestions what I may want to try?
(I'm kind of desperate right now).

Thanks for the help.

Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert@sfu.ca
Canada  V5A 1S6
========================================================================
-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-vortex-bug-request@beowulf.org