[Beowulf] job scheduler and health monitoring system

Sat Jan 11 09:09:57 PST 2014

On 01/10/2014 12:36 PM, reza azimi wrote:
> hello guys, 
> 
> I'm looking for a state of art job scheduler and health monitoring for
> my beowulf cluster and due to my research I've found many of them which
> made me confused. Can you help or recommend me the ones which are very
> hot and they are using in industry? 
> I have lm-sensors package on my servers and wanna a health monitoring
> program which record the temp as well, all I found are mainly record
> resource utilization. 
> Our workload are mainly MPI based benchmarks and we want to test some
> hadoop benchmarks in future.

Our solution with Grid Engine is to have a cron job monitoring the
contents of the IPMI SEL. If any messages are in the SEL that are not on
a whitelist, a file in /var gets generated (conversely, if no messages
are in the SEL, the file gets removed). We have a GE load sensor that
monitors for the presence of this file and places that node in an alarm
state when it sees this file, preventing new jobs from being scheduled
on the node. We then have Nagios monitoring the output of "qstat -xml"
on the scheduler nodes so we get notified of when a node goes into an
alarm state.

Skylar