<div dir="ltr">If anyone is interested, here is my solution, which seems good enough.<div>Someone will no doubt say there is a neater way!<br><div><br></div><div>A shell script which runs ibqueryerrors and returns 1 if anything is found:</div>
<div><br></div><div><div>#!/bin/bash</div><div># check for errors on the Infiniband fabric 0</div><div># another script runs for port 1</div><div><br></div><div>errors=`/usr/sbin/ibqueryerrors -c -s XmtWait -P0 | tail -n +2`</div>
<div>if [ -n "$errors" ] ; then</div><div> echo "Check for errors on Infiniband Fabric 0"</div><div> echo</div><div> echo $errors</div><div> exit 1</div><div>else</div><div> exit 0</div><div>
fi</div></div><div><br></div><div>For Monit monitoring, exit 0 means the service is OK, exit 1 means there is a problem.</div><div><br></div><div>So in monit:</div><div><br></div><div><div>check program ib0-errors with path "/usr/local/bin/check-ib0.sh"</div>
<div> every "30 * * * *"</div><div> if status == 1 then alert</div><div> alert <a href="mailto:my.email@domain.com">my.email@domain.com</a> with reminder on 30 cycles</div><div> set mail-format { subject: $DESCRIPTION }</div>
</div><div><br></div><div><br></div><div><br></div><div>(ps. monit is only returning the first line - to be revised)</div><div><br></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 19 June 2014 14:18, John Hearns <span dir="ltr"><<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Does anyone have good tips on moniroting a cluster for Infiniband errors?<div><br></div><div>Specifically Mellanox/OpenFabrics on an SGI cluster.</div>
<div><br></div><div>I am thinking of running ibcheckerrors or ibqueryerrors and parsing the output.</div>
<div><br></div><div>I have Monit set up on the cluster head node</div><div><a href="http://mmonit.com/monit/" target="_blank">http://mmonit.com/monit/</a><br></div><div><br></div><div>which I find quite good</div><div><br>
</div><div>Also if individual nodes could use gmetric to report port errors as a Ganglia metric I have the ganglia-alert script set up to send email if ganglia values exceed set thresholds.</div>
<div><br></div><div>Any ideas welcomed please.</div></div>
</blockquote></div><br></div>