RPC timeout problems

Thu Jan 18 08:38:51 PST 2001

Hi,

we are currently trying to set up a Beowulf system based on Redhat 6.2 plus
Redhat's 6.2 bugfixes plus the Scyld Beowulf RPMS.

The cluster is quite ordinary (for more information see below) but does have
a Gigabit connection between the master and the switch while all the nodes
are on 100 Mb.

When a node boots it often gets stuck trying to mount /home with the
following error (from /var/log/beowulf/node.x):

 setup_fs: Checking 10.0.0.1:/home (type=nfs)...
 setup_fs: Mounting 10.0.0.1:/home on /rootfs//home... (type=nfs; options=defaults,rsize=8192,wsize=8192)
 node_modprobe: installing kernel module: sunrpc
 node_modprobe: installing kernel module: lockd
 node_modprobe: installing kernel module: nfs
 mount: RPC: Timed out
 Failed to mount 10.0.0.1:/home on /home.

Also experimenting with the mount option timeo does not seem to help.
When one reboots nodes often enough they will eventually not run into the RPC
timeout error and come up.

Additionally, when a node is up it can happen that it suddenly goes
missing (from /var/log/messages):

 Jan 18 15:57:26 gigawulf bpmaster: ping timeout on slave 0

One problem that we had with the networking was that the NIC on the nodes
would autonegotiate half-duplex with the switch. This resulted in the
following error message:
eth0: Transmit error, Tx status register 82.
as well as very bad network speed, especially from the nodes to the master.
The solution to this was to force the NIC on the nodes into full-duplex mode.
The odd thing there is that the external connection from the master to our
network is done to a switch of the same make with an identical 100Mb NIC as
the nodes have and it autonegotiate full duplex just fine ...

On the master we still use MTU=1500 for the Gigabit link but have increased
the socket buffer settings.
echo 262144 > /proc/sys/net/core/rmem_max
echo 262144 > /proc/sys/net/core/wmem_max

System Information:
==================
Motherboard: ASUS-K7V
Processor  : AMD Athlon, 950 MHz

Networking - Master: 3C985B-SX (acenic driver)
Networking - Nodes : 3C905C Tornado 100baseTx (3c59x driver)
Networking - Switch: 3Com SuperStack 3 Switch 3300 with Gigabit expansion

Does anybody have an idea what could cause the RPC timeouts or how to find
out more about this problem ?

Thanks

Robert