[Beowulf] cli alternative to cluster top?

Donald Becker becker at scyld.com
Sun Nov 30 08:52:20 PST 2008


On Wed, 26 Nov 2008, Thomas Vixel wrote:

> I've been googling for a top-like cli tool to use on our cluster, but
> the closest thing that comes up is Rocks' "cluster top" script. That
> could be tweaked to work via the cli, but due to factors beyond my
> control (management) all functionality has to come from a pre-fab
> program rather than a software stack with local, custom modifications.
> 
> I'm sure this has come up more than once in the HPC sector as well --
> could anyone point me to any top-like apps for our cluster?

Most remote job mechanisms only think about starting remote processes, not 
about the full create-monitor-control-report functionality.

The Scyld system (currently branded "Clusterware") defaults to using a 
built-in unified process space.  That presents all of the processes 
running over the cluster in a process space on the master machine, with 
fully POSIX semantics.  It neatly solves your need with... the standard 
'top' program.

Most scheduling systems also have a way to monitor processes that they 
start, but I haven't found one that takes advantage of all information 
available and reports it quickly/efficiently.

There are many advantages of the Scyld implementation
  -- no new or modified process management tools need to be written.
    Standard utilities such as 'top' and 'ps' work unmodified,
    as well as tools we didn't specifically plan for e.g. GUI versions of 
    'pstree'.
  -- The 'killall' program works over the cluster, efficiently.
  -- All signals work as expected, including 'kill -9'.  (Most remote
     process starting mechanisms will just kill off the local endpoint,
     leaving the remote process running-but-confused.)
  -- Process groups and controlling-TTY groups works properly for job
     control and signals
  -- Running jobs report their status and statistics accurately -- an
     updated 'rusage' structure is sent once per second, and a final
     rusage structure and exit status is sent when the process terminates.

The "downside" is that we explicitly use Linux features and details, 
relying on kernel-version-specific features.  That's an issue if it's a 
one-off hack, but we've been using this approach continuously for 
a decade, since the Linux 2.2 kernel and over multiple 
architectures.  We've been producing supported commercial releases 
since 2000, longer than anyone else in the business.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA




More information about the Beowulf mailing list