[Beowulf] Monitoring and reporting Infiniband errors
John Hearns
hearnsj at googlemail.com
Thu Jun 19 07:10:43 PDT 2014
If anyone is interested, here is my solution, which seems good enough.
Someone will no doubt say there is a neater way!
A shell script which runs ibqueryerrors and returns 1 if anything is found:
#!/bin/bash
# check for errors on the Infiniband fabric 0
# another script runs for port 1
errors=`/usr/sbin/ibqueryerrors -c -s XmtWait -P0 | tail -n +2`
if [ -n "$errors" ] ; then
echo "Check for errors on Infiniband Fabric 0"
echo
echo $errors
exit 1
else
exit 0
fi
For Monit monitoring, exit 0 means the service is OK, exit 1 means there is
a problem.
So in monit:
check program ib0-errors with path "/usr/local/bin/check-ib0.sh"
every "30 * * * *"
if status == 1 then alert
alert my.email at domain.com with reminder on 30 cycles
set mail-format { subject: $DESCRIPTION }
(ps. monit is only returning the first line - to be revised)
On 19 June 2014 14:18, John Hearns <hearnsj at googlemail.com> wrote:
> Does anyone have good tips on moniroting a cluster for Infiniband errors?
>
> Specifically Mellanox/OpenFabrics on an SGI cluster.
>
> I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
> output.
>
> I have Monit set up on the cluster head node
> http://mmonit.com/monit/
>
> which I find quite good
>
> Also if individual nodes could use gmetric to report port errors as a
> Ganglia metric I have the ganglia-alert script set up to send email if
> ganglia values exceed set thresholds.
>
> Any ideas welcomed please.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140619/4f26d8f1/attachment.html>
More information about the Beowulf
mailing list