[Beowulf] Putting /home on Lusture of GPFS

Wed Dec 24 07:54:12 PST 2014

Everyone,

Thanks for the feedback you've provided to my query below. I'm glad I'm 
not the only one who thought of this, and a lot of you raised very good 
points I haven't thought about. While I've been following parallel 
filesystems for years, I have very little experience actually managing 
them up to this point. (My BG/P came with GPFS filesystem for /scratch, 
but everything was already setup before I got here, so I've only had to 
deal with it when something breaks).

You've all convinced me that this may not be an ideal solution 
arrangement, but if I go this route, GPFS might be a better fit for this 
than Lustre (mainly because Chris Samuels has proven it *is* possible 
with GPFS, and GPFS has snapshotting).

Joe Landman, as always, has provided a wealth of information, and the 
rest of you have pointed out other potential pitfalls. with this approach.

Thanks again for the feedback, and please keep the conversation going.

Prentice

On 12/23/2014 12:12 PM, Prentice Bisbal wrote:
> Beowulfers,
>
> I have limited experience managing parallel filesytems like GPFS or 
> Lustre. I was discussing putting /home and /usr/local for my cluster 
> on a GPFS or Lustre filesystem, in addition to using it just for 
> /scratch. I've never done this before, but it doesn't seem like all 
> that bad an idea. My logic for this is the following:
>
> 1. Users often try to run programs from in /home, which leads to 
> errors, no matter how many times I tell them not to do that. This 
> would make the system more user-friendly. I could use quotas/policies 
> to encourage them to use 'steer' them to use other filesystems if needed.
>
> 2. Having one storage system to manage is much better than 3.
>
> 3. Profit?
>
> Anyway, another person in the conversation felt that this would be 
> bad, because if someone was running a job that would hammer the 
> fileystem, it would make the filesystem unresponsive, and keep other 
> people from logging in and doing work. I'm not buying this concern for 
> the following reasons:
>
> If a job can hammer your parallel filesystem so that the login nodes 
> become unresponsive, you've got bigger problems, because that means 
> other jobs can't run on the cluster, and the job hitting the 
> filesystem hard has probably slowed down to a crawl, too.
>
> I know there are some concerns  with the stability of parallel 
> filesystems, so if someone wants to comment on the dangers of that, 
> too, I'm all ears. I think that the relative instability of parallel 
> filesystems compared to NFS would be the biggest concern, not 
> performance.
>