node status

Robert G. Brown rgb at phy.duke.edu
Mon Oct 8 08:24:51 PDT 2001


On Mon, 8 Oct 2001, Bob Campbell wrote:

> Okay, thanks for all the viewpoints on NIS, looks like rsync
> is the best way to go.
>
> There was mention of ways to help recognise and sync downed
> nodes. What I would like is a tool of some sort that would
> keep a live status of all the nodes, and what state they are in.
> These 'states' could be as simple as UP, DOWN, BOOTING, FAILED, etc.
> I would also like this to insert and remove the hosts from the list
> of available hosts.
>
>
> .....
>
> Ok, I know Beowulf2 already has this. Beowulf2 looks to do all this
> with bproc though, and I need to stay in userspace. For various
> reasons I need to stay with vendor supplied kernels so I cant
> compile bproc in.
>
>
> any thoughts on this, or anyone know of any software with similar
> features?

procstatd (available on http://www.phy.duke.edu/brahma) is at least one
way of doing it that is fairly low overhead.  I'm not sure about BOOTING
and FAILED -- those are distinguishable from DOWN only by inference --
but you can at retrieve a wealth of proc-based information in a simple
ascii-packed packet including uptime, load averages, network traffic
averages, and the like.  You can easily use the daemon to feed a perl
script or website based on a central host.

I should note that when a host goes down it is not at all easy to tell
from the outside.  Did it go down or is it busy?  Did it go down or did
the network connection to the host go away for some reason?  Even
querying the host daemon over the network requires either a TCP timeout
or consistently failing UDP connections to be able to "guess" that the
host is down (presuming that you otherwise trust your network's
stability).

Alternatively you can use an NFS mount.  Each host writes its state
information into e.g. /usr/share/beowulf/bX where X is the node id, and
an application that shares the same mount can open all the bX's and
compile a table and display it or take action however you like.

Just pinging a host gives you approximate up/down information -- at the
very least if it pings it MIGHT be up and running normally, while if it
doesn't ping there are likely problems with the host or the network.

Finally, there are (remote) shell-based methods, but they are all going
to be moderately expensive in systems resources as shells are moderately
expensive in systems resources and remote shells more so.

Hope some of this helps.

   rgb

>
>
> __________________________________________________
> Do You Yahoo!?
> NEW from Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
> http://geocities.yahoo.com/ps/info1
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu







More information about the Beowulf mailing list