[Beowulf] cli alternative to cluster top?

Mon Dec 1 15:22:35 PST 2008

That does sound interesting, but more for some of my personal projects.

It wouldn't work for the situation at hand because:
1) It sounds like it introduces a SPF (the head node).
2) Giving our developers cluster-wide 'killall' & 'kill' functionality
makes me cringe.
    Most of them only know just enough about Linux to be dangerous.
3) It would require completely reworking our current cluster solution;
    a daunting task to say the least.
4) There isn't much love for commercial & non-OSS software at our company.

On 11/30/08, Donald Becker <becker at scyld.com> wrote:
> On Wed, 26 Nov 2008, Thomas Vixel wrote:
>
>> I've been googling for a top-like cli tool to use on our cluster, but
>> the closest thing that comes up is Rocks' "cluster top" script. That
>> could be tweaked to work via the cli, but due to factors beyond my
>> control (management) all functionality has to come from a pre-fab
>> program rather than a software stack with local, custom modifications.
>>
>> I'm sure this has come up more than once in the HPC sector as well --
>> could anyone point me to any top-like apps for our cluster?
>
> Most remote job mechanisms only think about starting remote processes, not
> about the full create-monitor-control-report functionality.
>
> The Scyld system (currently branded "Clusterware") defaults to using a
> built-in unified process space.  That presents all of the processes
> running over the cluster in a process space on the master machine, with
> fully POSIX semantics.  It neatly solves your need with... the standard
> 'top' program.
>
> Most scheduling systems also have a way to monitor processes that they
> start, but I haven't found one that takes advantage of all information
> available and reports it quickly/efficiently.
>
> There are many advantages of the Scyld implementation
>   -- no new or modified process management tools need to be written.
>     Standard utilities such as 'top' and 'ps' work unmodified,
>     as well as tools we didn't specifically plan for e.g. GUI versions of
>     'pstree'.
>   -- The 'killall' program works over the cluster, efficiently.
>   -- All signals work as expected, including 'kill -9'.  (Most remote
>      process starting mechanisms will just kill off the local endpoint,
>      leaving the remote process running-but-confused.)
>   -- Process groups and controlling-TTY groups works properly for job
>      control and signals
>   -- Running jobs report their status and statistics accurately -- an
>      updated 'rusage' structure is sent once per second, and a final
>      rusage structure and exit status is sent when the process terminates.
>
> The "downside" is that we explicitly use Linux features and details,
> relying on kernel-version-specific features.  That's an issue if it's a
> one-off hack, but we've been using this approach continuously for
> a decade, since the Linux 2.2 kernel and over multiple
> architectures.  We've been producing supported commercial releases
> since 2000, longer than anyone else in the business.
>
> --
> Donald Becker				becker at scyld.com
> Penguin Computing / Scyld Software
> www.penguincomputing.com		www.scyld.com
> Annapolis MD and San Francisco CA
>
>