[Beowulf] wulfstat, wulflogger fix, new features

Robert G. Brown rgb at phy.duke.edu
Tue May 18 07:51:00 PDT 2004


Karl Bellve posted a bug in wulflogger that caused it to miss connecting
to the first host in the wulfhosts list until the second pass.  He also
requested a feature that would let wulflogger execute only a single time
and then exit so that it could be used in e.g. a cron script to graze
for downed hosts in a cluster easily.

I found the bug (a legacy from wulfstat where I closed stdin pre-curses,
which caused the first port SUCCESSFULLY returned from socket() to be
returned as 0 (being reused) which actually doesn't work.  This is a bug
in socket() I personally would say, but either way, when I eliminated
the close statement wulflogger now connects to the first host with no
problem first try.

I implemented the request by adding a -c count flag to both wulflogger
and wulfstat.  -c 1 is the behavior requested, but somebody may have use
for the greater flexibility permitted by it being a variable.  I also
updated both Usage and the man page for both applications, in the case
of wulflogger including an example fragment that might go into a cron
job to graze for down hosts and in the both cases adding a short section
on debugging (I've written the code to be tremendously self-debugging to
make it relatively easy to maintain or augment).

Still to implement:

  a) I want to add a ping to the connection engine to precede the
xmlsysd connection attempt.  ping actually is a bit of a pain -- the
usual iputils implementation requires suid root.  nmap, however, has
three or four distinct ways of "pinging" that don't require root
privileges, and eventually I'll try stealing one although the code is a
lot more complex than I'd like for a simple task.  Anybody with a
SHORT/SIMPLE version of userspace (e.g. ack) ping in C should feel free
to let me know where to find it.

  b) I need to do something about tracking running jobs in wulflogger,
and figure out a better display for them in wulfstat.

  c) I still have fantasies of writing gwulfstat on top of gtk.  This
could be a very cool application.

  d) And wulfweb needs love as well, although that is straightforward
web programming at this point -- wulflogger is the real tool involved.

Anyway, those of you who are using it, enjoy.  Those who aren't,
consider giving xmlsysd/wulf[stat,logger,web] a try.  It is a fairly
simple way to monitor an entire cluster (tested with order <100 hosts,
don't know how or if it scales to ~1000) in a lightweight fashion with
adjustable time granularity.

Those of you who are also LAN managers might consider using it to
monitor your LAN status as well.  The default wulfstat/wulflogger
display is something like:

#     Name       Status    Timestamp    load1  load5 load15 rx byts tx byts  si  so  pi po ctxt intr users
lilith             up   1084891476.44    0.01   0.04   0.01   9761   7171    0   0   0  22  148  170
asixteencharname   up   1084891476.44    0.01   0.04   0.01   9761   7171    0   0   0  22  148  170
lucifer            up   1084891610.24    0.00   0.02   0.00    226    709    0   0   0   9  135  104
uriel              up   1084887238.42    0.00   0.00   0.00   1030   1672    0   0   0   5   36  114
caine             down 
eve                up   1084888284.75    0.00   0.00   0.00    685   1168    0   0   0  11   21  109
serpent            up   1084877687.98    0.00   0.00   0.00   1116   1707    0   0   0   6   41  187
tyrial             up   1084891762.44    0.00   0.00   0.00   3146   3064    0   0   0   9  208  218
abel              down 
archangel          up   1084888715.71    0.00   0.00   0.00    119   1376    0   0   0  30   28  105

(used to look at my home cluster, with one machine turned off and one
machine down awaiting a reinstall.)  There is a display that only looks
at load, a display only for network traffic, one for network usage, even
one that tells you uptime and duty cycle (cpu cycles used/cpu cycles
available) from the last boot.  All GPL v2b...

  http://www.phy.duke.edu/~rgb/Beowulf/beowulf.php

I suggest rebuilding the source rpm or working from tarball, although
people running RH 9 can probably install the binary rpms without
disaster.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list