[Beowulf] cluster storage design

Wed Mar 23 14:02:51 PST 2005

Joe Landman <landman at scalableinformatics.com> wrote:
> 
> Brian Henerey wrote:
> 
> > Hello all,
> >
> > I have a 32 node cluster with 1 master and 1 data storage server with 
> > 1.5 TB’s of storage. The master used to have storage:/home mounted on 
> > /home via NFS. I moved the 1.5TB RAID array of storage so it was 
> > directly on the master. This decreased the time it took for our 
> > program to run by a factor of 4. I read somewhere that mounting the 
> > data to the master via NFS was a bad idea for performance, but am not 
> > sure what the best alternative is. I don’t want to have to move data 
> > on/off the master each time I run a job because this will slow it down 
> > as more people are using it.
> >
> 
> If your problems are I/O bound, and you have enough local storage on 
> each compute node, and you can move the data in a reasonable amount of 
> time, the local I/O will likely be the fastest solution. You have 
> already discovered this when you moved to a local attached RAID. If you 
> have multiple parallel reads/writes to the data from each compute node, 
> you will want some sort of distributed system. If the master thread is 
> the only one doing IO, then you want the fast storage where it is. 

Also keep in mind that if the data used on the nodes fits
into memory _and_ you tend to run the same software over and
over, then typically that data will only need to be read off disk once
on each node and will subsequently be accessed from the file system
cache.  That mode of data access is many times faster
than physically reading from a disk.  So don't toss out the idea
of local data storage if the cluster happens to have slowish disks
on the compute nodes.  It will also cache from NFS but it may take
a very, very long time for all nodes to read it at once. 

Depending on your cluster topology, interconnect, and budget
you might also consider multiple file servers.  That will
speed things up at the cost of a bit more hardware, more
complexity (which node mounts which file server).  Also, for that to
work well data should be mostly reads, since writes to a common
file need to go to M file servers instead of just one.

Finally, and this effect can be surprising large - be careful about
writes of results back to a single file server.  When N nodes naively
direct stdout back to a single NFS server the line by line writes can
drive that server into the ground.  Conversely, if the nodes
write to /tmp and then when done copy that fall to the NFS server
in one fell swoop it may work better, especially if the processes
finish asynchronously. If they all finish at the same time think
twice before having them all do:

  cp /tmp/$HOSTNAME_output.txt /nfsmntpoint/accum_dir/

simultaneously.

> NFS  provides effectively a single point of data flow, and hence is a 
> limiting factor (generally).

Also double check that NFS is using hard mounts.  Else you may
fall prey to the dreaded "big block of nulls" problem.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech