[Beowulf] Configuration Management and Monitoring of a Debian Etch Beowulf Cluster

Wed Sep 12 06:16:01 PDT 2007

Hi,

> I've managed to put together a simple 2-node cluster using Debian etch ,
> OpenMPI , FAI & Cfengine.

do you mean 2 nodes + 1 server? or 1 server and 1 node?

> I'm looking for ideas that can help me with building a better
> self-healing cluster. Right now I'm making rule files for cfengine and
> would acknowledge any input on sample files and important configurations
> that need to be made for the cluster's health. (Although it's
> site-specific but I'm sure I can get good hints out of them)

since everything depends on your configuration... since every cluster is
different, no clue can be given there...

Here is what I think anyway: FAI is a good point to start with: you can
so have a fully automated install to start with on all your nodes, then
incrementally grow from it with cfengine, BUT if you intend to use your
cluster for a long period of time, it is a big bargain to think about
restarting from FAI (so zero data on the disk and a brand new install):
an image based deployment system would be more efficient.

Start with FAI -> fresh install -> increment update with cfengine

then manage some snapshots of the system with image based deployment
system and synchronize your nodes from time to time with a fresh
deployment of the last snapshot.

It costs less in term of time, and ressource consumption than starting
from scratch, moreover recovery is faster (and safer) than replaying the
full process from start.

If you are using debian, you have to be sure your packages repository is
synchronized for all your nodes (between 2 cluster snapshots), so
setting an apt-cacher for your cluster (or a more general http proxy
server), will allow you to enforce package synchronization for your
cluster, and a fair use of access to the external debian repository).
OK, this is really important for huge clusters, but all the clusters are
conveived to grow.

This is critical since replaying asynchronously some packages install
can lead you to many different results (and lots of failures).

> However I'd also be glad to see if you have any monitoring system in
> mind that can cooperate with cfengine in the maintenance job. I've
> looked briefly into Ganglia and Nagios so far. It seems Ganglia is
> mostly meant for large (groups of) clusters and focuses on hw resources.
> Nagios seems to be better-suited for my job, but the gurus at cfengine
> mailing list believe that cfenvd & cfexecd can provide equal monitoring
> & recovery capability (in terms of response time).
> What's your take on either of them?

Ganglia is OK, and can be used to quickly check your cluster usage,
nagios deal with critical services for the cluster (NFS, DNS, DHCP
server, TFTP...). Be carefull about the load added to your network with
all the monitoring tools you are setting up (using cfenvd & cfexecd
could give you better control of the additional charge on your
infrastructure).
To monitor the cluster status, you should use your batch scheduler
interface (to look for free nodes, dead ones...).

Hope this helped,

Julien Leduc