[Beowulf] after update sgeexecd not starting correctly on reboot
David Mathog
mathog at caltech.edu
Tue Nov 25 14:40:38 PST 2008
This is an odd one, and I hope one of you has seen it and fixed it,
because the only way I have been able to trigger the bug is through a
reboot.
I updated one node from Mandriva 2007.1 to 2008.1. Those are both 2.6.x
kernels, and are as you might guess about a year apart. Both use
the exact same SGE distribution, which is NFS mounted on /usr/SGE6.
On a reboot of the newer system, /etc/rc.d/init.d/sgeexecd, which is the
last thing to start in runlevel 3 (except for S99local, which doesn't do
anything except "touch /var/lock/subsys/local") fails. First it
spews a bunch of lines which look like a script did "set", and as a side
effect, this pushes all the other text lines off the console, and then
it emits
can't determine path to Grid Engine binaries
without starting sge_execd. On the older system the exact same scipt
starts up with none of this drama, leaving sge_execd running.
However, once I logon as root at the console on the newer system, it
happily starts up with:
/etc/rc.d/init.d/sgeexecd start
There are no SGE variables defined in .bashrc etc. The init script
has these prerequisites, as on the older system:
# Provides: sgeexecd
# Required-Start: $network $remote_fs
Ring any bells?
I think maybe the NFS mounting is different, so that the remote_fs
prerequisite isn't really satisfied, even though the associated script
has run. The sgeexecd script does include a test:
while [ ! -d "$SGE_ROOT" -a $count -le 120 ]; do
count=`expr $count + 1`
sleep 1
done
but since SGE_ROOT is the mount point, the test will be true whether or
not the NFS mount has completed. Maybe I'll change that to
$SGE_ROOT/bin and see if it helps.
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list