[Beowulf] disabling bad nodes
reuti at staff.uni-marburg.de
Mon Mar 27 12:19:34 PST 2006
Am 26.03.2006 um 21:07 schrieb James Rustad:
> This is a strange question, but
> Is there any way to disable a bad node in PBS without being the
> system administrator?
> I am lining up about 50 jobs in the queue and they fail
> sequentially when they hit
> the bad node. This often seems to happen on the weekends when nobody
> is around to reboot the node.
> Can I specify within PBS "don't use node015" or something like that.
> Jim Rustad
> I may be using TORQUE rather than PBS, by the way
although I can't answer your question directly: what is causing this
black hole in the cluster? I faced this with a filled /tmp on some
nodes from time to time. As we are using SGE, I use their load-sensor
facility to check the free space there and put the node into alarm-
state otherwise, i.e. disabling the queues on this node. Maybe
something similar could be implemented also with Torque, to get some
self-healing at weekends. - Reuti
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf