node problems

Thu Apr 4 07:38:50 PST 2002

Hi all

i have a 64node athlon cluster, at the moment i have about 19 nodes that
are flaky, they stay up for a bit and then fall over. one can still ping
them but not telnet or ftp. I'm trying to keep as many up as possible
(more nodes means i can get the final calculations done for my phd
thesis faster....)

this may be an unrelated problem but i see errors in the logs about
telnet

node01 telnetd[16941]: ttloop: peer died: EOF 
xinetd[17099]: warning: can't get client address: Connection reset by
peer
Apr  5 00:32:21 node01 rlogind[17099]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr  5 00:32:21 node01 rshd[17098]: getpeername: Transport endpoint is
not connected
Apr  5 00:32:21 node01 ftpd[17097]: getpeername (in.ftpd): Transport
endpoint is not connected
Apr  5 00:32:31 node01 rlogind[17100]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr  5 00:32:31 node01 xinetd[17101]: warning: can't get client address:
Connection reset by peer
Apr  5 00:32:31 node01 xinetd[17102]: warning: can't get client address:
Connection reset by peer
Apr  5 00:32:31 node01 xinetd[17103]: warning: can't get client address:
Connection reset by peer
Apr  5 00:32:31 node01 ftpd[17101]: getpeername (in.ftpd): Transport
endpoint is not connected

i am using enfuzion to do job dispatch and collect. by looking at 
the packets i see the enfuzion director on the head node attempts to
send a UDP packet to the node. all udp ports on the nodes are blocked
i checked this by scanning a node with nmap. older installs of redhat
(i.e my workstation) seem to have udp ports enabled.

regardless of the ttloop error the machine appears to work for a while.
i.e enfuzion logs in jobs run etc, untill sudennly all stops.
the machines remain up, and can be pinged. but no other services (rsh
ssh etc run) If i connect a monitor and keyboard to the node it is also
unresponive.

this is a problem across many nodes.
has anyone who uses enfuzion seen this error with nodes that are a rh7.1
install

On one node i have seen on 2 occasions 

CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d40040000000017a at 540040000000017a

decoding this using a until i found on the net

Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(2): f60020000000017a @ 760020000000017a
        External tag parity error
        Correctable ECC error
        MISC register information valid
        Memory heirarchy error
        Request: Generic error
        Transaction type : Generic
        Memory/IO : I/O

can anyone tell me what the Restart IP invalid means. is this a dead cpu
or a memory problem causing a mce? 

cheers

Kim
-- 
______________________________________________________________________ 

Kim Branson
Phd Student
Structural Biology
CSIRO Health Sciences and Nutrition
Walter and Eliza Hall Institute
Royal Parade, Parkville, Melbourne, Victoria
Ph 61 03 9662 7136
Email kbranson at wehi.edu.au

______________________________________________________________________