[Beowulf] SGE forgetting its queues at restart following node rename
mathog
mathog at caltech.edu
Wed Sep 21 12:35:15 PDT 2011
FYI (Just to have it posted, in case anybody else ever runs into
this.)
A little while back I moved same names around in the cluster. To do
so, in SGE a bunch
of queues and some hosts were removed and then added back. There was
much trial and error
in doing so - I make no claim that the right commands were issued in
the proper order. However,
in the end all the queues were as desired and they all stayed up and
running. Until the node
was rebooted, at which point SGE came back up with only two queues
present. After
much poking around the problem was finally locate: some of the old
host names and old queues
were still present in files under:
$SGEROOT/default/spool/qmaster/qinstances
and as soon as SGE hit one of those during startup, it would stop
creating all further queues.
The error message that resulted when that happened was of this form:
09/21/2011 12:22:56|qmaster|safserver|E|cannot recreate queue all.q
from disk because of unknown host mendel
and appeared in:
$SGEROOT/default/spool/qmaster/messages
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list