<div dir="ltr">Does anyone have good tips on moniroting a cluster for Infiniband errors?<div><br></div><div>Specifically Mellanox/OpenFabrics on an SGI cluster.</div><div><br></div><div>I am thinking of running ibcheckerrors or ibqueryerrors and parsing the output.</div>
<div><br></div><div>I have Monit set up on the cluster head node</div><div><a href="http://mmonit.com/monit/">http://mmonit.com/monit/</a><br></div><div><br></div><div>which I find quite good</div><div><br></div><div>Also if individual nodes could use gmetric to report port errors as a Ganglia metric I have the ganglia-alert script set up to send email if ganglia values exceed set thresholds.</div>
<div><br></div><div>Any ideas welcomed please.</div></div>