hi,<br>we have 100 nodes cluster. we have strange problem on cluster with torque 2.4.8<br>a job submitted for 256 cores interactively gives following error in pbs server :<br><br>PBS_Server;LOG_ERROR::sync_node_jobs, stray job <a href="http://2004.nodesvr.clust1.in">2004.nodesvr.clust1.in</a> found on <a href="http://node07.clust1.in">node07.clust1.in</a><br>
PBS_Server;LOG_ERROR::sync_node_jobs, stray job <a href="http://2004.nodesvr.clust1.in">2004.nodesvr.clust1.in</a> found on <a href="http://node05.clust1.in">node05.clust1.in</a><br><br>Also master mom says :<br>pbs_mom: LOG_ERROR::node_bailout, <a href="http://2004.nodesvr.clust1.in">2004.nodesvr.clust1.in</a> join_job failed from <a href="http://node07.clust1.in">node07.clust1.in</a> 17 - recovery attempted)<br>
pbs_mom: LOG_ERROR::sister could not communicate (15059) in <a href="http://2004.nodesvr.clust1.in">2004.nodesvr.clust1.in</a> job_start_error from node <a href="http://node0.clust1.in">node0.clust1.in</a> in jo<br>b_start_error<br>
Jan 7 08:49:54 node07 pbs_mom: LOG_ERROR::exec_bail, exec_bail: sent 16 ABORT requests, should be 20<br>node_bailout, node_bailout: received KILL/ABORT request for job <a href="http://2004.nodesvr.clust1.in">2004.nodesvr.clust1.in</a> from node <a href="http://node07.clust1.in">node07.clust1.in</a><br>
<br>node07 logs says :<br>pbs_mom;Job;<a href="http://2004.nodesvr.clust1.in">2004.nodesvr.clust1.in</a>;JOIN JOB as node 15<br>pbs_mom;Svr;pbs_mom;LOG_ERROR::Transport endpoint is not connected (107) in im_request, rpp_flush<br>
<br>The job could not allocate shell for 40 minutes and then we got shell on master mom node.<br><br>We are not able to find out the exact issue..any help will be appreciated.<br><br>--<br>Akshar B.<br>