[Beowulf] cluster storage design

Wed Mar 23 16:36:22 PST 2005

On Wed, Mar 23, 2005 at 09:41:46AM -0600, Brian Henerey wrote:
> 
> I have a 32 node cluster with 1 master and 1 data storage server with 1.5
> TB's of storage.  The master used to have storage:/home mounted on /home via
> NFS. I moved the 1.5TB RAID array of storage so it was directly on the
> master. This decreased the time it took for our program to run by a factor
> of 4.

yes .. that is a good thing

> I read somewhere that mounting the data to the master via NFS was a
> bad idea for performance, but am not sure what the best alternative is. I
> don't want to have to move data on/off the master each time I run a job
> because this will slow it down as more people are using it. 

for users, you have 2 choices ??
	/home	on one big "home server"

	or  automagically sync users loginID and pwd from node to node
		( little more work.. but not as bad as it sounds )

	if the "home server" dies ... everybody is dead

	if each node is standalone .. there is no issues with "master" dying

for running jobs ....
	an automated queue is good ... users doesn't necessarily dictate
	which nodes to run the jobs on, but a good queuer will allow
	users to specify preferences

for "/data"  where all nodes share a common big 100TB data  farm ..
	- you have NFS or SANs or ??

	- getting good nic cards and good switches helps a lot

	- change your NFS parameters to send 16K or 32K bytes at a time
	instead of 512K 

	- dual or quad channel bonding should help with thruput too.. 

	- a TB sized "/data" shouldn't be noticably slow across the nodes

	- /data should be on the machine where the apps uses it the most

	- since /data is probably shared across multiple nodes, it might
	be worth it ( definitely worth it )  to buy another 4 or 8 disks
	and use it as backups of /data on other nodes
		- you now have 3 "master nodes" with local /data

		- you will have to rsync and rdiff your changes from
		node to node

		- 1 TB of disks is about $600 now days ( 4x $150 each )

	- structuring your /data into /data/xxx  and /data/yyy and /data/zzz
	will allow multiple nodes to have all of its data local to where
	all the disk i/o access is being done  local to itself as opposed
	to across the slow ethernet

> I know there are probably many solutions but I'm curious what the people on
> this list do. It seems to me that SAN's are very expensive compared to just
> building servers with 4 x 500GB hard drives. I've considered just launching
> my lam-mpi jobs from whatever storage server has the appropriate data on it,
> but this doesn't seem ideal. 

for me ... lots of redundant IDE disks is way way better/faster than san/nas

> How does performance compare from having the data local on the master via
> running it off a PVFS? 

c ya
alvin