node problems
Kim Branson
Kim.Branson at csiro.au
Thu Apr 4 07:38:50 PST 2002
Hi all
i have a 64node athlon cluster, at the moment i have about 19 nodes that
are flaky, they stay up for a bit and then fall over. one can still ping
them but not telnet or ftp. I'm trying to keep as many up as possible
(more nodes means i can get the final calculations done for my phd
thesis faster....)
this may be an unrelated problem but i see errors in the logs about
telnet
node01 telnetd[16941]: ttloop: peer died: EOF
xinetd[17099]: warning: can't get client address: Connection reset by
peer
Apr 5 00:32:21 node01 rlogind[17099]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr 5 00:32:21 node01 rshd[17098]: getpeername: Transport endpoint is
not connected
Apr 5 00:32:21 node01 ftpd[17097]: getpeername (in.ftpd): Transport
endpoint is not connected
Apr 5 00:32:31 node01 rlogind[17100]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr 5 00:32:31 node01 xinetd[17101]: warning: can't get client address:
Connection reset by peer
Apr 5 00:32:31 node01 xinetd[17102]: warning: can't get client address:
Connection reset by peer
Apr 5 00:32:31 node01 xinetd[17103]: warning: can't get client address:
Connection reset by peer
Apr 5 00:32:31 node01 ftpd[17101]: getpeername (in.ftpd): Transport
endpoint is not connected
i am using enfuzion to do job dispatch and collect. by looking at
the packets i see the enfuzion director on the head node attempts to
send a UDP packet to the node. all udp ports on the nodes are blocked
i checked this by scanning a node with nmap. older installs of redhat
(i.e my workstation) seem to have udp ports enabled.
regardless of the ttloop error the machine appears to work for a while.
i.e enfuzion logs in jobs run etc, untill sudennly all stops.
the machines remain up, and can be pinged. but no other services (rsh
ssh etc run) If i connect a monitor and keyboard to the node it is also
unresponive.
this is a problem across many nodes.
has anyone who uses enfuzion seen this error with nodes that are a rh7.1
install
On one node i have seen on 2 occasions
CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d40040000000017a at 540040000000017a
decoding this using a until i found on the net
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(2): f60020000000017a @ 760020000000017a
External tag parity error
Correctable ECC error
MISC register information valid
Memory heirarchy error
Request: Generic error
Transaction type : Generic
Memory/IO : I/O
can anyone tell me what the Restart IP invalid means. is this a dead cpu
or a memory problem causing a mce?
cheers
Kim
--
______________________________________________________________________
Kim Branson
Phd Student
Structural Biology
CSIRO Health Sciences and Nutrition
Walter and Eliza Hall Institute
Royal Parade, Parkville, Melbourne, Victoria
Ph 61 03 9662 7136
Email kbranson at wehi.edu.au
______________________________________________________________________
More information about the Beowulf
mailing list