[Beowulf] Monitoring and reporting Infiniband errors
John Hearns
hearnsj at googlemail.com
Thu Jun 19 06:18:19 PDT 2014
Does anyone have good tips on moniroting a cluster for Infiniband errors?
Specifically Mellanox/OpenFabrics on an SGI cluster.
I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
output.
I have Monit set up on the cluster head node
http://mmonit.com/monit/
which I find quite good
Also if individual nodes could use gmetric to report port errors as a
Ganglia metric I have the ganglia-alert script set up to send email if
ganglia values exceed set thresholds.
Any ideas welcomed please.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140619/99516936/attachment.html>
More information about the Beowulf
mailing list