[Beowulf] error starting job : stray job; master mom log says : can not compose message to sister

John Hearns hearnsj at googlemail.com
Sat Jan 8 00:08:40 PST 2011


On 8 January 2011 05:01, akshar bhosale <akshar.bhosale at gmail.com> wrote:
> hi,
> we have 100 nodes cluster. we have strange problem on cluster with torque
> 2.4.8
> a job submitted for 256 cores interactively gives following error in pbs
> server :
>
> PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
> on node07.clust1.in
> PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
> on node05.clust1.in

Disable both nodes - node05 and node07 - in your scheduler.
Submit your job.

When you have time, log into those nodes and look at the system logs
at about the time the failed job starts, and at the mom log.
Are the nodes mounting the users home directory? Are they
authenticating properly - ie are they contacting their NIS or LDAP
server?
ps -eaf --forest   on the nodes - do you see any processes belonging
to this job 2004?



More information about the Beowulf mailing list