Processes get SIGKILL after about 15 secs (Scyld)

Thomas Clausen tclausen at wesleyan.edu
Wed Mar 26 14:23:42 PST 2003


Hi,

I have a problem I can't solve:

We are running Scyld with kernel

Linux version 2.4.17-0.18.18_Scyldsmp (support at builder.scyld.com) (gcc
version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #1 SMP Thu Jul 11
18:54:56 EDT 2002

on 70 nodes. 20 of them are newly acquired dual Athlons on Tyan 2466 boards.
When I start a process on any of these nodes (ex: bpsh 64 sleep 500) they
get a SIGKILL after about 15 secs:

[pid  5685] nanosleep({500, 0}, 0)      = -1 EINTR (Interrupted system call)
[pid  5684] <... select resumed> )      = 3 (in [4 5 6], left {286, 370000})
[pid  5685] +++ killed by SIGKILL +++
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
close(4)                                = 0
read(5, "", 4096)                       = 0
close(5)                                = 0
read(6, "", 4096)                       = 0
close(6)                                = 0
wait4(-1, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], 0, NULL) = 5685
write(2, "bpsh: Child process exited abnor"..., 39bpsh: Child process exited
abnormally.
) = 39
wait4(-1, 0xbffff548, 0, NULL)          = -1 ECHILD (No child processes)
_exit(255)                              = ?

I have tried to find out where the signal comes from but without success.
I can run memtest86 (booting from floppy) on the machines and the hardware
seems to be running fine. I'm at a loss...

Thomas

-- 
   .^.    Thomas Clausen, graduate student
   /V\    Physics Department, Wesleyan University, CT
  // \\   Tel 860-685-2018, fax 860-685-2031
 /(   )\  
  ^^-^^   Use Linux



More information about the Beowulf mailing list