[Beowulf] error starting job : stray job; master mom log says : can not compose message to sister

akshar bhosale akshar.bhosale at gmail.com
Fri Jan 7 21:01:32 PST 2011

we have 100 nodes cluster. we have strange problem on cluster with torque
a job submitted for 256 cores interactively gives following error in pbs
server :

PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node07.clust1.in
PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node05.clust1.in

Also master mom says :
pbs_mom: LOG_ERROR::node_bailout, 2004.nodesvr.clust1.in join_job failed
from node07.clust1.in 17 - recovery attempted)
pbs_mom: LOG_ERROR::sister could not communicate (15059) in
2004.nodesvr.clust1.in job_start_error from node node0.clust1.in   in jo
Jan  7 08:49:54  node07 pbs_mom: LOG_ERROR::exec_bail, exec_bail: sent 16
ABORT requests, should be 20
node_bailout, node_bailout: received KILL/ABORT request for job
2004.nodesvr.clust1.in from node node07.clust1.in

node07 logs says :
pbs_mom;Job;2004.nodesvr.clust1.in;JOIN JOB as node 15
pbs_mom;Svr;pbs_mom;LOG_ERROR::Transport endpoint is not connected (107) in
im_request, rpp_flush

The job could not allocate shell for 40 minutes and then we got shell on
master mom node.

We are not able to find out the exact issue..any help will be appreciated.

Akshar B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20110108/79f386b1/attachment.html>

More information about the Beowulf mailing list