[Beowulf] after update sgeexecd not starting correctly on reboot

David Mathog mathog at caltech.edu
Tue Nov 25 14:40:38 PST 2008


This is an odd one, and I hope one of you has seen it and fixed it,
because the only way I have been able to trigger the bug is through a
reboot.  

I updated one node from Mandriva 2007.1 to 2008.1.  Those are both 2.6.x
kernels, and are as you might guess about a year apart.  Both use
the exact same SGE distribution, which is NFS mounted on /usr/SGE6.
On a reboot of the newer system, /etc/rc.d/init.d/sgeexecd, which is the
last thing to start in runlevel 3 (except for S99local, which doesn't do
anything except "touch  /var/lock/subsys/local") fails.  First it
spews a bunch of lines which look like a script did "set", and as a side
effect, this pushes all the other text lines off the console, and then
it emits

  can't determine path to Grid Engine binaries

without starting sge_execd.  On the older system the exact same scipt
starts up with none of this drama, leaving sge_execd running.

However, once I logon as root at the console on the newer system, it
happily starts up with:

/etc/rc.d/init.d/sgeexecd start

There are no SGE variables defined in .bashrc etc. The init script
has these prerequisites, as on the older system:

# Provides:       sgeexecd 
# Required-Start: $network $remote_fs

Ring any bells?  

I think  maybe the NFS mounting is different, so that the remote_fs
prerequisite isn't really satisfied, even though the associated script
has run.  The sgeexecd script does include a test:

while [ ! -d "$SGE_ROOT" -a $count -le 120 ]; do
   count=`expr $count + 1`
   sleep 1
done

but since SGE_ROOT is the mount point, the test will be true whether or
not the NFS mount has completed.  Maybe I'll change that to
$SGE_ROOT/bin and see if it helps.


Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list