[Beowulf] large scratch space on cluster

Tue Sep 29 11:37:50 PDT 2009

Jörg Saßmannshausen wrote:
> Dear all,
> 
> I was wondering if somebody could help me here a bit.
> For some of the calculations we are running on our cluster we need a 
> significant amount of disc space. The last calculation crashed as the ~700 GB 
> which I made available were not enough. So, I want to set up a RAID0 on one 8 
> core node with 2 1.5 TB discs. So far, so good.
> 
> However, I was wondering whether it does make any sense to somehow 'export' 
> that scratch space to other nodes (4 cores only). So, the idea behind that 
> is, if I need a vast amount of scratch space, I could use the one in the 8 
> core node (the one I mentioned above). I could do that with nfs but I got the 
> feeling it will be too slow. Also, I only got GB ethernet at hand, so I 
> cannot use some other networks here. Is there a good way of doing that? Some 
> words like i-scsi and cluster-FS come to mind but to be honest, up to now I 
> never really worked with them.
> 

You could do something crazy like dynamically create distributed filesystems
using GlusterFS (or other Open Source FS) using the local storage of each node
that the job is using.  This way it is dedicated to your job, share it in your
job, and not impact other jobs.  Each node needs a disk, but that isn't too
expensive.  Also, you can skip the RAID part (unless it is for performance)
because if the disk dies, it only affects that one node.

We tried this for awhile.  It worked ok (with GlusterFS), but then we got
a good Lustre setup and the performance of the dynamic version didn't justify
the effort and maintenance.  However, on a smaller system where I don't have
that many resources, I might try this again.

Craig

> Any ideas?
> 
> All the best
> 
> Jörg 
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)