[Beowulf] cluster storage design

Wed Mar 23 15:11:41 PST 2005

I concur with David, when necessary running jobs on local compute node disk takes immense load off on a storage node / nfs file server. Here is some brief documentation and template on our website for using this method (/scratch can be /tmp):

http://www.bu.edu/dbin/sph/departments/biostatistics/linga_documentation.php#scratch

Cheers,
Patrice

> 
> Joe Landman <landman at scalableinformatics.com> wrote:
> > 
> > Brian Henerey wrote:
> > 
> > > Hello all,
> > >
> > > I have a 32 node cluster with 1 master and 1 data storage server with 
> > > 1.5 TB’s of storage. The master used to have storage:/home mounted on 
> > > /home via NFS. I moved the 1.5TB RAID array of storage so it was 
> > > directly on the master. This decreased the time it took for our 
> > > program to run by a factor of 4. I read somewhere that mounting the 
> > > data to the master via NFS was a bad idea for performance, but am not 
> > > sure what the best alternative is. I don’t want to have to move data 
> > > on/off the master each time I run a job because this will slow it down 
> > > as more people are using it.
> > >
> > 
> > If your problems are I/O bound, and you have enough local storage on 
> > each compute node, and you can move the data in a reasonable amount of 
> > time, the local I/O will likely be the fastest solution. You have 
> > already discovered this when you moved to a local attached RAID. If you 
> > have multiple parallel reads/writes to the data from each compute node, 
> > you will want some sort of distributed system. If the master thread is 
> > the only one doing IO, then you want the fast storage where it is. 
> 
> Also keep in mind that if the data used on the nodes fits
> into memory _and_ you tend to run the same software over and
> over, then typically that data will only need to be read off disk once
> on each node and will subsequently be accessed from the file system
> cache.  That mode of data access is many times faster
> than physically reading from a disk.  So don't toss out the idea
> of local data storage if the cluster happens to have slowish disks
> on the compute nodes.  It will also cache from NFS but it may take
> a very, very long time for all nodes to read it at once. 
> 
> Depending on your cluster topology, interconnect, and budget
> you might also consider multiple file servers.  That will
> speed things up at the cost of a bit more hardware, more
> complexity (which node mounts which file server).  Also, for that to
> work well data should be mostly reads, since writes to a common
> file need to go to M file servers instead of just one.
> 
> Finally, and this effect can be surprising large - be careful about
> writes of results back to a single file server.  When N nodes naively
> direct stdout back to a single NFS server the line by line writes can
> drive that server into the ground.  Conversely, if the nodes
> write to /tmp and then when done copy that fall to the NFS server
> in one fell swoop it may work better, especially if the processes
> finish asynchronously. If they all finish at the same time think
> twice before having them all do:
> 
>   cp /tmp/$HOSTNAME_output.txt /nfsmntpoint/accum_dir/
> 
> simultaneously.
> 
> 
> > NFS  provides effectively a single point of data flow, and hence is a 
> > limiting factor (generally).
> 
> Also double check that NFS is using hard mounts.  Else you may
> fall prey to the dreaded "big block of nulls" problem.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>