[Beowulf] Remote console management
julien.leduc at lri.fr
Fri Sep 23 04:33:22 PDT 2005
Bruce Allen wrote:
> We're getting ready to put together our next large Linux compute
> cluster. This time around, we'd like to be able to interact with the
> machines remotely. By this I mean that if a machine is locked up,
> we'd like to be able to see what's on the console, power cycle it,
> mess with BIOS settings, and so on, WITHOUT having to drive to work,
> go into the cluster room, etc.
This is the goal, but all the solutions I have ever tried implied a
monthly journey to the cluster room to manually reboot the problematic
> One possible solution is to buy nodes that have IPMI cards. These
> piggyback on the ethernet LAN and let you interact with the machine
> even in the absence of an OS. With the appropriate tools running on a
> remote machine, you can interact with the nodes even if they have no
> OS on them or are hung.
I would say that it depends of the problem hunging the machine... for
example there are well known problems with IPMI cards that you cannot
contact anymore when installed on a freeBSD system.
Moreover, before buying some IPMI cards, you should be aware that there
are diffenrent hardware implementation of IPMI cards (have a look at
Intel's website they have some slides explaining the difference between
cheap IPMI and complete implementation).
> Another solution is to use the DB9 serial ports of the nodes. You
> have an 'administrative' box containing lots of high-port-count serial
> cards (eg, Cyclades 32 or 64 port cards) and then run a serial cable
> from each node to this box. By remotely logging into this admin box
> you can access the serial ports of the machines, and if the BIOS has
> the right settings/support, this lets you have keyboard/console access.
> Or one can do both IPMI + remote serial port access.
remote serial port access should be done outside IPMI, but still I would
say that it depend of the IPMI board you are installing.
I even think that if you want to cut the costs, you can avoid IPMI and
rely on ssh, then remote serial port login and then controlled power
plugs to reboot the nodes if any of the previous solution does not work.
> Could people on this list please report their experiences with these
> or other
> approaches? In particular, does someone have a simple and inexpensive
> solution (say < $100/node) which lets them remotely:
> - power cycle a machine
> - examine/set BIOS values
> - look at console output even for a dead/locked/unresponsive box
> - ???
A cheap solution I used previous year was to use USB->8 x db9 with
nullmodem cables, along with kermit, you can get a cheap terminal server
(extended with the right number of USB hubs).
Something interesting we used (and are still using without any problem
since installation), is a homemade reboot solution, replacing the
frontpanel with a controled switch (in the final hardware design we
found some industrial grade controlled transistor) every boxe allows to
control 16 nodes and you can chain 256 of them, which is ok for big
clusters, the only problem, is that as a homemade solution, you have to
solder everything (replacing frontpanels is not a big deal, because, it
just means replacing the original pins with the one of your solution, no
soldering should be required on the nodes).
The cost is about 100$ for 16 nodes if I remember everything. Aftre that
you can control those transistors to reboot / halt / start every node
from a single rs232 port.
I can send you more details if you are interested.
More information about the Beowulf