[Beowulf] Remote console management

Sat Sep 24 18:57:20 PDT 2005

Due to the cost (in Australia at the time) of adding decent  
monitoring/system control, we went down this exact route.  While we  
are happy with the cluster, there are some real issues.

Unfortunately, lmsensors does not work with our montherboards, so we  
can not do monitoring (other than what our modified OpenPBS was  
doing) of the system.  The down side is a reduced quality of service  
to our users.  We get reports of users jobs randomly dieing before  
they finish.  When we track this down, it usually correlates to a  
particular node or a couple of nodes.  We pull them out of the  
cluster and run hardware diagnostics and discover that a fan or  
something has died and that the cpu is running hot... and has  
consequently slowed down... resulting in longer run times for user  
jobs... which means they go over their requested walltime and get  
killed by pbs.

This whole process takes several weaks to diagnose a broken fan.  By  
the time users complain, we trawl through the logs, pull the node  
out, test it and get it fixed.  The people time put in find the  
problem is far greater than the cost of the nodes.

On top of that, with cheap nodes we find about a 10% variation of  
runtimes with our test suite...

With the our current itanium cluster, we can get more information  
than we can poke a stick at.  Since the nodes are far more  
sophisticated and server grade we know within a few minutes if some  
piece of hardware has failed and we can log a call and get it  
replaced.  Its also worth noting that we get <1% variation in  
runtimes on our test suite.

Its quite surprising the difference quality hardware makes.

Stu.

> This brings up an interesting point and I realize this does come  
> down to
> a design philosophy, but cluster economics sometimes create non  
> standard
> solutions. So here is another way to look at "out of band monitoring".
> Instead of adding  layers of monitoring and control, why not take that
> cost and buy extra nodes. (but make sure you have a remote hard power
> cycle capability). If a node dies and cannot be rebooted, turn it  
> off, and
> fix it later. Of course monitoring fans and temperatures is a good  
> thing
> (tm), but if node will not boot, and you have to play with the  
> BIOS, then
> I would consider it broken.
>
> Because you have "over capacity" in your cluster (you bought extra  
> nodes)
> this does not impact the amount work that needs to get done.  
> Indeed, prior
> to the failure you can have the extra nodes working for you. You fully
> understand that at various time one or two nodes will be off line.  
> They
> are taken out of the scheduler and there is no need to fix them right
> away.
>
> This approach also depends on what you are doing with your
> cluster and the cost of nodes etc. In some cases out-of-band access
> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other  
> node
> in the head and fix it tomorrow" approach is also reasonable.

--
Dr Stuart Midgley
sdm900 at gmail.com