[Beowulf] cli alternative to cluster top?
Thomas Vixel
tvixel at gmail.com
Mon Dec 1 15:22:35 PST 2008
That does sound interesting, but more for some of my personal projects.
It wouldn't work for the situation at hand because:
1) It sounds like it introduces a SPF (the head node).
2) Giving our developers cluster-wide 'killall' & 'kill' functionality
makes me cringe.
Most of them only know just enough about Linux to be dangerous.
3) It would require completely reworking our current cluster solution;
a daunting task to say the least.
4) There isn't much love for commercial & non-OSS software at our company.
On 11/30/08, Donald Becker <becker at scyld.com> wrote:
> On Wed, 26 Nov 2008, Thomas Vixel wrote:
>
>> I've been googling for a top-like cli tool to use on our cluster, but
>> the closest thing that comes up is Rocks' "cluster top" script. That
>> could be tweaked to work via the cli, but due to factors beyond my
>> control (management) all functionality has to come from a pre-fab
>> program rather than a software stack with local, custom modifications.
>>
>> I'm sure this has come up more than once in the HPC sector as well --
>> could anyone point me to any top-like apps for our cluster?
>
> Most remote job mechanisms only think about starting remote processes, not
> about the full create-monitor-control-report functionality.
>
> The Scyld system (currently branded "Clusterware") defaults to using a
> built-in unified process space. That presents all of the processes
> running over the cluster in a process space on the master machine, with
> fully POSIX semantics. It neatly solves your need with... the standard
> 'top' program.
>
> Most scheduling systems also have a way to monitor processes that they
> start, but I haven't found one that takes advantage of all information
> available and reports it quickly/efficiently.
>
> There are many advantages of the Scyld implementation
> -- no new or modified process management tools need to be written.
> Standard utilities such as 'top' and 'ps' work unmodified,
> as well as tools we didn't specifically plan for e.g. GUI versions of
> 'pstree'.
> -- The 'killall' program works over the cluster, efficiently.
> -- All signals work as expected, including 'kill -9'. (Most remote
> process starting mechanisms will just kill off the local endpoint,
> leaving the remote process running-but-confused.)
> -- Process groups and controlling-TTY groups works properly for job
> control and signals
> -- Running jobs report their status and statistics accurately -- an
> updated 'rusage' structure is sent once per second, and a final
> rusage structure and exit status is sent when the process terminates.
>
> The "downside" is that we explicitly use Linux features and details,
> relying on kernel-version-specific features. That's an issue if it's a
> one-off hack, but we've been using this approach continuously for
> a decade, since the Linux 2.2 kernel and over multiple
> architectures. We've been producing supported commercial releases
> since 2000, longer than anyone else in the business.
>
> --
> Donald Becker becker at scyld.com
> Penguin Computing / Scyld Software
> www.penguincomputing.com www.scyld.com
> Annapolis MD and San Francisco CA
>
>
More information about the Beowulf
mailing list