[Beowulf] Which do you prefer local disk installed OS or NFS rooted?

Tony Travis ajt at rri.sari.ac.uk
Fri Jul 16 03:10:47 PDT 2004


Brent M. Clements wrote:

> Good Afternoon All:
> 
> Let me start by giving a little background.
> 
> Currently all of our clusters on campus have local disks which by using
> the systemimager suite of tools has an os image installed
> There are one or two clusters that are nfsrooted.
> 
> I'd like to know from all of you, which way is everyone leaning when it
> comes to clusters and os distribution.

Hello, Brent.

We've got a 32-node Athlon XP 2400+ cluster running "ClusterNFS" under 
openMosix 2.4.22-openmosix-2:

	http://clusternfs.sourceforge.net/
	http://openmosix.sourceforge.net/

The root partition of the 'head' node is exported read-only to the 
'diskless' compute nodes which have symbolic links to volatile files:

	/export/root/<IP ADDRESS>

All the 'diskless' nodes have a 40Gb local disk for:

	/dev/hda1 /var/tmp	# and /tmp -> /var/tmp
	/dev/hda2 swap

The system works well except for one problem: The 2Gb limit on filesize. 
This is an inevitable consequence of using clusterNFS, and I've not been 
able to do anything about it yet. Our system is based on 'BOBCAT':

	http://www.epcc.ed.ac.uk/bobcat/

Our 'BOBCAT' cluster architecture consists of a 'head' node with three 
NIC's running a ROOTNFS fileserver and 'diskless' nodes (strictly 
speaking 'dataless' nodes) with two NIC's on different ethernets, one 
for PXE/DHCP/NFS and the other for openMosix IPC:

PXE/DHCP/NFS		IPC			LAN
192.168.0.0		192.168.1.0		143.234.32.0

192.168.0.1	node1	192.168.1.1	mpe1	143.234.32.11	bobcat
192.168.0.2	node2	192.168.1.2	mpe2	143.234.32.12	topcat
192.168.0.3	node3	192.168.0.3	mpe3
...
192.168.0.32	node32	192.168.0.32	mpe32

On our system, node1 (bobcat) is the PXE/DHCP/NFS server and node2 
(topcat) is used for interactive logins. I've done quite a lot of 
network monitoring using tools like "iptraf", "ibmonitor" and "iftop".

It is something of a myth about 'high' NFS traffic on clusters using 
'diskless' compute nodes. It depends on what they are doing: If you're 
running computationally intensive jobs the NFS traffic is minimal once 
the programs are in the filesystem cache on the compute node. There is, 
of course, high NFS traffic when booting the 'diskless' nodes but we 
manage to boot 30 nodes in about four minutes without using a 3COM 
unmanaged 100Base-T 'private' ethernet switch (i.e not on the LAN).

One advantage of 'BOBCAT' architecture is the segregation of network 
traffic: You can still control the compute nodes no matter how much IPC 
traffic is going on between them because the IPC traffic is on a 
separate ethernet. Our system uses 192.168.0.0 for openMosix IPC, and 
192.168.1.0 for ssh initiation of MPI processes and sockets.

I've written scripts to add and remove 'diskless' nodes from the cluster:

	mknode	# add a node to the cluster
	rmnode	# remove a node from the cluster

Best wishes,

	Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687
-------------- next part --------------
#!/bin/sh
# @(#)mknod.sh  2004-02-18  A.J.Travis

#
# Make ClusterNFS files for diskless node
#

if [ $# -ne 1 ]; then
	echo usage: mknode n
	exit 1
fi

NET=192.168
IPC=$NET.0.$1
NFS=$NET.1.$1
TAG='$$'IP=$NFS'$$'

#
# Check client tag files are in place
#
cd /

# Mountpoint for /proc filesystem
if [ ! -d 'proc$$CLIENT$$' ]; then
	mkdir 'proc$$CLIENT$$'
fi

#
# Client versions need to be edited manually
# inittab: initdefault 3
#
cd /etc

# update hosts database
fgrep -v $IPC hosts | fgrep -v $NFS > hosts.new
echo "$IPC	mpe$1" >> hosts.new
echo "$NFS	node$1" >> hosts.new
mv hosts hosts.bak
mv hosts.new hosts

if [ ! -f 'fstab$$CLIENT$$' ]; then
	echo warning: 'fstab$$CLIENT$$' missing
fi
if [ ! -f 'inittab$$CLIENT$$' ]; then
	echo warning: 'inittab$$CLIENT$$' missing
fi

#
# Enable only what is needed on clients
# The file DoNotExecuteOnClients should not exist.
#
cd /etc/init.d
rm -f DoNotExecuteOnClients

# Disable everything
for i in `ls | fgrep -v '$$CLIENT$$'`; do
	if [ ! -h ${i}'$$CLIENT$$' ]; then
		rm -f ${i}'$$CLIENT$$'
		ln -s DoNotExecuteOnClients ${i}'$$CLIENT$$'
	fi
done

# Enable only what is needed on clients
for i in \
	functions \
	autofs \
	halt \
	single \
	network \
	syslog \
	portmap \
	keytable \
	random \
	sshd \
	openmosix \
	ypbind
do
	rm ${i}'$$CLIENT$$'
done

# Prevent init 0, 1 and 6 from shutting down the network
cd /etc
for dir in rc0.d rc1.d rc6.d; do
	cd $dir
	for file in K*network K*netfs; do
		if [ ! -h ${file}'$$CLIENT$$' ]; then
			rm -f ${file}'$$CLIENT$$'
			ln -s DoNotExecuteOnClients ${file}'$$CLIENT$$'
		fi
	done
	cd ..
done

#
# Node-specific tag files and directories
#
cd /
if [ ! -h dev$TAG ]; then
	rm -rf dev$TAG
	ln -s /export/root/$NFS/dev dev$TAG
fi
if [ ! -h tmp$TAG ]; then
	rm -rf tmp$TAG
	ln -s /export/root/$NFS/tmp tmp$TAG
fi
if [ ! -h root$TAG ]; then
        rm -rf root$TAG
        ln -s /export/root/$NFS/root root$TAG
fi
if [ ! -d var$TAG ]; then
	rm -rf var$TAG
	ln -s /export/root/$NFS/var var$TAG
fi

# Mount table
cd /etc
if [ ! -h 'mtab$$CLIENT$$' ]; then
	rm -f 'mtab$$CLIENT$$'
	ln -s /proc/mounts 'mtab$$CLIENT$$'
fi

# Hostname and gateway
cd /etc/sysconfig
sed -e s/HOSTNAME=.*/HOSTNAME=mpe$1/ \
    -e s/GATEWAY=.*/GATEWAY=$IPC/ network > network$TAG

#
# Static IP addresses for eth0 and eth1 on clients
# Ignore eth2 on clients (LAN interface on head node)
#
cd /etc/sysconfig/network-scripts
if [ ! -h 'ifcfg-eth2$$CLIENT$$' ]; then
	rm -f 'ifcfg-eth2$$CLIENT$$'
	ln -s NoPresentOnClients 'ifcfg-eth2$$CLIENT$$'
fi

# Writeable directories for each client
cd /export
if [ ! -d root ]; then
	mkdir root
fi
cd root
if [ ! -d $NFS ]; then
        mkdir $NFS
fi
cd $NFS
if [ ! -d dev ]; then
        mkdir dev
	mkdir dev/pts
	cp -a /dev/MAKEDEV dev
	dev/MAKEDEV -d dev console generic
fi
if [ ! -d root ]; then
        mkdir root
	cp -a /root/.bashrc root
	cp -a /root/.ssh root
fi
if [ ! -d tmp ]; then
        mkdir tmp
fi
if [ ! -d var ]; then
        mkdir var
fi
cd var
if [ ! -d empty ]; then
        mkdir empty
	mkdir empty/sshd
fi
if [ ! -d lib ]; then
        mkdir lib
	mkdir lib/dhcp
	mkdir lib/rpm
fi
if [ ! -d lock ]; then
	mkdir lock
	mkdir lock/subsys
fi
if [ ! -d log ]; then
        mkdir log
	mkdir log/news
fi
if [ ! -d run ]; then
	mkdir run
	mkdir run/netreport
fi
if [ ! -d tmp ]; then
        mkdir tmp
fi
-------------- next part --------------
#!/bin/sh
# @(#)rmnode.sh  2004-07-15  A.J.Travis

#
# Remove ClusterNFS files for diskless node
#

if [ $# -ne 1 ]; then
	echo usage: rmnode n
	exit 1
fi

NET=192.168
IPC=$NET.0.$1
NFS=$NET.1.$1
TAG='$$'IP=$NFS'$$'

cd /etc

# update hosts database
fgrep -v $IPC hosts | fgrep -v $NFS > hosts.new
mv hosts hosts.bak
mv hosts.new hosts

#
# Node-specific tag files and directories
#
cd /
if [ -h dev$TAG ]; then
	rm -rf dev$TAG
fi
if [ -h tmp$TAG ]; then
	rm -rf tmp$TAG
fi
if [ -h root$TAG ]; then
        rm -rf root$TAG
fi
if [ -d var$TAG ]; then
	rm -rf var$TAG
fi

# Hostname and gateway
cd /etc/sysconfig
if [ -f network$TAG ]; then
	rm -f network$TAG
fi

# Writeable directories for each client
cd /export/root
if [ -d $NFS ]; then
        rm -rf $NFS
fi


More information about the Beowulf mailing list