[Beowulf] ibswinfo, a tool to monitor unmanaged Infiniband switches
Kilian Cavalotti
kilian.cavalotti.work at gmail.com
Thu Apr 30 13:57:46 PDT 2020
Dear Beowulfers,
If your clusters use Infiniband, you know there are only two types of
switches: managed or unmanaged. The former come with SSH, a web
interface, SNMP and everything ; the latter come with LEDs.
The only (and officially recommended) way to monitor unmanaged
switches is to go take a physical look at their PSU and fan LEDs from
time to time. Which is obviously not ideal for remote administration,
monitoring or getting an alert when something's wrong.
To solve that problem, we made a little shell script that does just
that: get inventory data, status info, and metrics like fan speeds,
temperatures or power usage from unmanaged Infiniband switches:
https://github.com/stanford-rc/ibswinfo
It took a little reverse-engineering and a good amount of guessing,
but it seems to work, it fits the need, and well... it's free. So
we're happy to share it with everyone, in case it could be useful to
someone else.
Cheers,
--
Kilian
More information about the Beowulf
mailing list