[Beowulf] distributed file storage solution?

Tue Dec 12 09:01:52 PST 2006

On Mon, Dec 11, 2006 at 05:53:58PM -0800, Bill Broadley wrote:
> Lustre:
> * client server
> * scales extremely well, seems popular on the largest of clusters.
> * Can survive hardware failures assuming more than 1 block server is
>   connected to each set of disks
> * unix only.
> * relatively complex.
> 
> PVFS2:
> * Client server
> * scales well
> * can not survive a block server death.
> * unix only
> * relatively simple.
> * designed for use within a cluster.

Hi Bill

As a member of the PVFS project I just wanted to comment on your
description of our file system.  I would say that PVFS is every bit as
fault tolerant as Lustre.  The redundancy model for the two file
systems are pretty simliar: both file systems rely on shared storage
and high availability software to continute operating in the face of
disk failure.  What Lustre has done a much better job of than we have
is documenting the HA process.  This is one of our (PVFS) areas of
focus in the near-term.  

We may not have documented the process in enough detail, but one can
definitely set up PVFS servers with links to shared storage and make
use of things like IP takeover to deliver resiliancy in the face of
disk failure, and have had this ability for several years now (PVFS
users can check out 'pvfs2-ha.pdf' in our source for a starting
point).

> So the end result (from my skewed perspective) is:
> * Lustre and PVFS2 are popular in clusters for sharing files in larger
>   clusters where more than single file server worth of bandwidth is
>   required.  Both I believe scale well with bandwidth but only allow
>   for a single metadata server so will ultimately scale only as far
>   as single machine for metadata intensive workloads (such as lock
>   intensive, directory intensive, or file creation/deletion
>   intensive workloads).  Granted this also allows for exotic
>   hardware solutions (like solid state storage) if you really need
>   the performance.

PVFS v2 has offered multiple metadata servers for some time now.  Our
metadata operations scale well with the number of metadata servers.
You are absolutely correct that PVFS metadata performance is dependant
on hardware, but you need not get so exotic as solid state to see high
metadata rates.  The OSC PVFS deployment has servers with RAID and
fast disks, and can deliver quite high metadata rates.

Another point I'd like to make about PVFS is how well-suited it is for
MPI-IO applications.  The ROMIO MPI-IO implementation (the basis for
many MPI-IO implementations) contains a highly-efficent PVFS driver.
This driver speaks directly to PVFS servers, bypassing the kernel.  It
also contains optimizations for collective metadata operations and
noncontiguous I/O.  Applications making use of MPI-IO, or higher-level
libraries built on top of MPI-IO such as parallel-netcdf or (when
configured correctly) HDF5 are likely to see quite good performance
when running on PVFS.

> Hopefully others will expand and correct the above.

Happy to do so!  

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA                 B29D F333 664A 4280 315B