PBFS: pros & contras
laytonjb at bellsouth.net
Mon Jun 17 18:13:18 PDT 2002
Ivan Oleynik wrote:
> We are trying to understand an importance of PBFS for the cluster
> environment in order to finalize our cluster configuration.
I guess I'll take a stab at this one. I'm assuming you mean PVFS,
Parallel Virtual FileSystem.
> People are saying that PBFS tremendously boost parallel IO especially for
> MPI type applications. At the same time, we have heard about several
> technical problems with PBFS related with its stability. In particular,
> beowulf engineers tell us that we have to have 2 IDE hard drives per node
> to have the system to survive in case of failure of one of 2 IDE HDs.
First, PVFS is intended as a high speed scratch filesystem where you
read and write data while your code is running. Then, typically, once the
code is done, you stream the data off of PVFS to some stable filesystem.
During the PVFS file IO, if a node goes down (for whatever reason),
the data may or may not be lost. If the hard drive on the node died and
this node is part of PVFS, well, you've lost your data (hence the people
telling you that you need to run RAID-0 on the nodes). Although you can
continue to use PVFS, but the data won't be written to the node that
is down. However, if something like the NIC or power supply died,
then when the node comes back up, you will be able to access your
data again (there are a few gotchas in there, but generally this is true).
Just remember that the intent of PVFS is a high-speed scratch filesystem
when you run your codes. It is not intended to be a general purpose
filesystem (you can't really run binaries out of the filesystem for example).
You have to decide for yourself if you need to add some things to your
cluster to improve your odds or recovering PVFS data. You can run
RAID-0 on the nodes, add failover redundant power supplies in the node,
break up PVFS into several smaller PVFS filesystems across subsets
of the nodes, etc. It's up to you to decide. However, let me say that in
2 years of 24/7 operations of our 64 node cluster, we have lost 2 HDs
and 1 NIC. The two HDs were lost in the first couple of months, the NIC
died a little later. Not bad in my opinion for commodity hardware.
> Although we are mostly involved in scientific computing and not going to
> do any Gigabytes data mining, we are trying to have as universal system as
> possible in order to address properly the new type of problems that can
> appear on our horizon in the future.
> I would appreciate some input about the importance of PBFS for beowulf
> type systems, especially for the MPI type applications. How it is
> difficult to keep it up and running? Is there easy way to turn it off in
> case we won't not want it?
PVFS is very easy to setup and use. It is built on top of an existing
filesystem. So you can put the data and metadata directories where
ever you want (there are a couple of things you need to pay attention
to, such as choosing partitions that are pretty close to the same size
with about the same amount of free space). You can read the docs at
the URL above to see how easy it is.
Deciding if PVFS is important for you applications is up to you. Do you
do lots of IO during the runs? How big are the files? Do you checkpoint
your application? If so, how often? How many nodes do you usually
run across? Is IO a fairly large percentage of your total run time?
MPI-IO built on top of PVFS can produce some remarkable IO rates
(rivaling some of the ASCI systems). Here are some links to some PVFS
I hope this helps. If not, please don't hesitate to post again or follow-up
> Ivan I. Oleynik E-mail : oleynik at chuma.cas.usf.edu
> Department of Physics
> University of South Florida
> 4202 East Fowler Avenue Tel : (813) 974-8186
> Tampa, Florida 33620-5700 Fax : (813) 974-5813
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf