[Beowulf] Remote console management
sdm900 at gmail.com
Sat Sep 24 18:57:20 PDT 2005
Due to the cost (in Australia at the time) of adding decent
monitoring/system control, we went down this exact route. While we
are happy with the cluster, there are some real issues.
Unfortunately, lmsensors does not work with our montherboards, so we
can not do monitoring (other than what our modified OpenPBS was
doing) of the system. The down side is a reduced quality of service
to our users. We get reports of users jobs randomly dieing before
they finish. When we track this down, it usually correlates to a
particular node or a couple of nodes. We pull them out of the
cluster and run hardware diagnostics and discover that a fan or
something has died and that the cpu is running hot... and has
consequently slowed down... resulting in longer run times for user
jobs... which means they go over their requested walltime and get
killed by pbs.
This whole process takes several weaks to diagnose a broken fan. By
the time users complain, we trawl through the logs, pull the node
out, test it and get it fixed. The people time put in find the
problem is far greater than the cost of the nodes.
On top of that, with cheap nodes we find about a 10% variation of
runtimes with our test suite...
With the our current itanium cluster, we can get more information
than we can poke a stick at. Since the nodes are far more
sophisticated and server grade we know within a few minutes if some
piece of hardware has failed and we can log a call and get it
replaced. Its also worth noting that we get <1% variation in
runtimes on our test suite.
Its quite surprising the difference quality hardware makes.
> This brings up an interesting point and I realize this does come
> down to
> a design philosophy, but cluster economics sometimes create non
> solutions. So here is another way to look at "out of band monitoring".
> Instead of adding layers of monitoring and control, why not take that
> cost and buy extra nodes. (but make sure you have a remote hard power
> cycle capability). If a node dies and cannot be rebooted, turn it
> off, and
> fix it later. Of course monitoring fans and temperatures is a good
> (tm), but if node will not boot, and you have to play with the
> BIOS, then
> I would consider it broken.
> Because you have "over capacity" in your cluster (you bought extra
> this does not impact the amount work that needs to get done.
> Indeed, prior
> to the failure you can have the extra nodes working for you. You fully
> understand that at various time one or two nodes will be off line.
> are taken out of the scheduler and there is no need to fix them right
> This approach also depends on what you are doing with your
> cluster and the cost of nodes etc. In some cases out-of-band access
> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other
> in the head and fix it tomorrow" approach is also reasonable.
Dr Stuart Midgley
sdm900 at gmail.com
More information about the Beowulf