[Beowulf] SGE forgetting its queues at restart following node rename

mathog mathog at caltech.edu
Wed Sep 21 12:35:15 PDT 2011


FYI  (Just to have it posted, in case anybody else ever runs into 
this.)

A little while back I moved same names around in the cluster.  To do 
so, in SGE a bunch
of queues and some hosts were removed and then added back.  There was 
much trial and error
in doing so - I make no claim that the right commands were issued in 
the proper order.  However,
in the end all the queues were as desired and they all stayed up and 
running.  Until the node
was rebooted, at which point SGE came back up with only two queues 
present.  After
much poking around the problem was finally locate:  some of the old 
host names and old queues
were still present in files under:

   $SGEROOT/default/spool/qmaster/qinstances

and as soon as SGE hit one of those during startup, it would stop 
creating all further queues.
The error message that resulted when that happened was of this form:

   09/21/2011 12:22:56|qmaster|safserver|E|cannot recreate queue all.q 
from disk because of unknown host mendel

and appeared in:

   $SGEROOT/default/spool/qmaster/messages

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list