> I am using mpich2 on linux cluster, I kept having errors like the following > > rank 14 in job 2 cn128_57798 caused collective abort of all ranks > exit status of rank 14: killed by signal 9 signal 9 is sigkill (not segv or abrt, etc), and I'd be a bit surprised if this happened other than by someone killing the process.