[Beowulf] Cluster install and admin approach (newbie question)

Fri Aug 28 08:07:32 PDT 2009

> * if the /var filesystem is shared, race conditions happen (all nodes
> want to write on the same files). I had this problem and moved to a
> local /var filesystem.

indeed, shared /var is simply a bug.  non-shared NFS /var is viable,
but generally pointless.

> * if /var is local (which it may because the disks do exist), the
> whole point of central point for easy admin vanishes, because I would

eh?

> had to create all the /var structure that packages need to work, on
> each node (would be easier to do: "for $node; ssh $install_cmd; done",
> than guessing which dirs I need to create or files to copy).

but if your nodes are nfs-root, you won't be installing anything on them:
you'll be installing on the nfs-root.

> * if /var is tmpfs all forensics are certainly gone after failure
> (Murphy told me this one ;).

syslog is very happy to log over the network.

> Everything I read on the subject do underline the advantages of
> diskless approaches but miss to alert to this problem and/or to solve
> it. On the other side, the distributed approach tools (where every
> node is autonomous) seem to be halted (as systemimager - which is used
> in the Oscar project) or discontinued, or truly overblown for my
> reference scale (IBM's xCat); so it really seems that I'm missing

there's also OneSIS.

> something.
>
> The question is what you do about this ?

setting up your own nfs-root cluster is a simple exercise.  if you're not
very familiar with *nix booting/daemons/init scripts, it will take a few 
tries to get the config right, but the end result is pretty simple and
robust.  remote syslog, preferably with console-over-net (ipmi sol,
netconsole) means that there's nothing interesting on the local /var.