I've got 8 linux boxes, what now

Fri Dec 7 13:59:00 PST 2001

> On Thu, Dec 06, 2001 at 04:40:43PM -0800, Chris Majewski wrote:
> 
> > We're a  computer science department  investigating, very tentatively,
> > the   possibility  of  installing   a  linux   cluster  as   our  next
> > general-purpose compute  server. To date we've been  using things like
> > expensive multiprocessor SUN machines. 

On Fri, 7 Dec 2001, Greg Lindahl wrote:

><deleted excellent advice/suggestions>

> With any of the 3, you still need to work out a way of administering
> the system to keep them synchronized.
> 
> TurboLinux has a cluster admin system that helps you keep system disks
> synchronized.
> 
> You can use "rsync" or cfengine, which are traditional Linux sysadmin
> tools.
> 
> Scyld Beowulf doesn't really address this situation. However, it
> could, with a modest amount of work. Don, do you have any comments
> about this? Since it's many users running many jobs, including
> interactive ones, it's not really the area that "Beowulf clusters"
> traditionally address. I wish people would work on this, though, as
> I'd love to have a prepackaged solution I could sell in this area.

To add a tiny bit to Greg's reply -- one more solution is to use
kickstart to install your compute nodes (allowing them to be installed
identically by installing them from a single script), yup (or a similar
tool) to update your hosts (allowing them to be updated from a common
package repository from a common scripted set of instructions), and
NIS/NFS to manage user accounts and regulate access.  With tools like
these correctly configured, installing (or reinstalling) a PXE-enabled
cluster node can be reduced to a five minute task -- plug it in, turn it
on, wait five minutes or until you hear the system reboot itself (which
might take only two-three minutes, depending on your network bandwidth
and server load and the size of your node installation).

Package management tools like RPM and the associated power of kickstart
and yup provide an excellent solution to the problem of keeping anything
from a cluster to department of workstations synchronized AND version
consistent.  NIS isn't necessarily an ideal solution for all clusters
but should work fine for your purpose, as is NFS.  They both scale well
up to at least 100 workstations/nodes.

I'd also like to second Greg's wish to have a better, totally GPL/open
source toplevel toolset for managing "plain old compute clusters" where
users might wish to login to nodes OR "the cluster", run interactive
jobs, and so forth and still achieve job migration, load balancing and
control.  Mosix isn't an ideal solution because it runs in kernelspace
and requires a custom kernel in addition to being resource inefficient
and relatively fault intolerant.  A number of the toolsets that are out
there are less open than I like sourcewise as well -- I dislike having
to "register" at a site in order to be allowed to download source or
binary installations.

A nontrivial problem, I know, but that's what computer scientists are
FOR -- to solve nontrivial problems.  Right?

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu