[Beowulf] error starting job : stray job; master mom log says : can not compose message to sister
akshar bhosale
akshar.bhosale at gmail.com
Fri Jan 7 21:01:32 PST 2011
hi,
we have 100 nodes cluster. we have strange problem on cluster with torque
2.4.8
a job submitted for 256 cores interactively gives following error in pbs
server :
PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node07.clust1.in
PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node05.clust1.in
Also master mom says :
pbs_mom: LOG_ERROR::node_bailout, 2004.nodesvr.clust1.in join_job failed
from node07.clust1.in 17 - recovery attempted)
pbs_mom: LOG_ERROR::sister could not communicate (15059) in
2004.nodesvr.clust1.in job_start_error from node node0.clust1.in in jo
b_start_error
Jan 7 08:49:54 node07 pbs_mom: LOG_ERROR::exec_bail, exec_bail: sent 16
ABORT requests, should be 20
node_bailout, node_bailout: received KILL/ABORT request for job
2004.nodesvr.clust1.in from node node07.clust1.in
node07 logs says :
pbs_mom;Job;2004.nodesvr.clust1.in;JOIN JOB as node 15
pbs_mom;Svr;pbs_mom;LOG_ERROR::Transport endpoint is not connected (107) in
im_request, rpp_flush
The job could not allocate shell for 40 minutes and then we got shell on
master mom node.
We are not able to find out the exact issue..any help will be appreciated.
--
Akshar B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20110108/79f386b1/attachment.html>
More information about the Beowulf
mailing list