Jobs hanging after a period of time
Jeff Layton
jeffrey.b.layton at lmco.com
Wed Oct 30 06:29:42 PST 2002
Good mornign,
We've got a Linux cluster with 64 nodes of dual PIII/850's connected
with Fast Ethernet. We've also installed OpenPBS 2.3pl2-1 throughout
the cluster.
We have one node (penguin46) that only has the exechost RPM
installed. Jobs that include penguin46 will work for a day or two and
then all of a sudden any job using penguin46 will just hang. The actual
code starts running, but it will just hang after a period of time. The
PBS logs show nothing except job start, job end (or job kill if we do a
qdel). Also, I see nothing strange at all in /var/log/messages.
I know this is a rather nebulous error description, but this is the best
we can find out in over two weeks of checking. We have reinstalled
and reconfigured PBS on penguin46 several times with no change in
behavior. We have checked the NIC by flooding it with ping packets
(I know it's not an extensive test, but it's a start :) and we starting to
look at the switch port as well.
Thanks!
Jeff
--
Jeff Layton
Senior Engineer
Lockheed-Martin Aeronautical Company - Marietta
Aerodynamics & CFD
"Is it possible to overclock a cattle prod?" - Irv Mullins
This email may contain confidential information. If you have received this
email in error, please delete it immediately, and inform me of the mistake by
return email. Any form of reproduction, or further dissemination of this
email is strictly prohibited. Also, please note that opinions expressed in
this email are those of the author, and are not necessarily those of the
Lockheed-Martin Corporation.
More information about the Beowulf
mailing list