Because XFS is BETTER (Re: opinion on XFS)

Jeff Layton laytonjb at
Fri May 10 05:30:42 PDT 2002

Donald Becker wrote:

> On Thu, 9 May 2002, Eray Ozkural wrote:
> > On Thursday 09 May 2002 04:48, Donald Becker wrote:
> > > On Thu, 9 May 2002, Eray Ozkural wrote:
> > Okay. Now that may indeed be the case, because I never used XFS code
> > prior to the one based on 2.4.x release. It does seem to be very
> > stable at the moment, though, so perhaps you can give it a whirl
> > again.
> > I trust your knowledge of the kernel more than any other person on the
> > list, so maybe you can tell us, in your opinion, which filesystem is
> > truly the best in an I/O intensive environment (parallel database/IR
> > algorithms, etc.)

I think Don's response is right on target. I also agree with his
filesystem "hierarchy" below. We've been using PVFS in a
test mode for over a year and it works very well. We're installing
some additional hard drives to really put it into production (and
modifying our codes to use it fully).

I like to look towards the future a bit, and my crystal ball tells
me that a new technology will really help a true cluster filesystem
become a reality. iSCSI is coming around (finally). This will
potentially allow you to connect disparate SCSI disks within a
cluster. Of course, the filesystem on top of these connected disks
is another story. IBM is really pushing iSCSI as well as a small company
that I really like - Consensys Raidzone (
Consensys will have a nice iSCSI device ready in a few months with
full filesystem support. A new cluster that uses GigE for an iSCSI
interconnect could see some decent speed. If you are running the
parallel job communication across the same interconnect you will
lose some performance, but it could be a price/performance winner
(see for example,
New motherboards will have multiple interconnect built into the
MB so you can split out iSCSI communications and job communication.
Motherboards like the new P4 boards with with ServerWorks GC-LE

have built-in FastE and GigE (an interesting experiment would be to
test changing the roles of the GigE and FastE in an iSCSI cluster with
your code). These boards also have PCI-X slots! Also imagine putting
something like Scali or Myrinet in the PCI-X, using GigE for iSCSI,
and FastE for something else (backup job interconnect or admin
traffic or high bandwidth node monitoring). OK, I know I'm getting
ahead of myself but I think it's amazing to see this kind of thing available
today on commodity systems.

There are other options as well. Bluearc has a very nice and very
fast NAS box with a new "GFS-like" filesystem that can sit on top
of several of their boxes.

This probably won't do you much good right now if you have an
existing cluster with IDE drives (or even SCSI drives). Using
something like PVFS for temporary data storage and then streaming
your data from PVFS to other data storage (NAS, SAN, etc.) seems
like the way to take advantage of existing hardware.

If you want speed, I've toyed with the idea of putting solid-state
storage devices in each node of a cluster and then using PVFS on
top of them! I've seen NVRAM boards with up to 2 Gigs of memory
available for 64-bit PCI slots. String these together across a few
hundred nodes connected with GigE and we're talking hyperspace!

I'm sorry if I've gotten ahead of myself, but this is too much fun!

Good Luck!


> Oh, I'm not the right person to ask about filesystems.  I used to know
> about them, but that was long ago.  There are two things that I do know:
>     Last year's conventional wisdoms is now completely wrong, and
>     We don't yet have a good general-purpose cluster file system.
> The critical issues of the '80s, such as using the disk geometry
> information, optimizing the packing of small files, and bitmaps vs. free
> lists are all irrelevant today.  NFS is slow because it was designed to
> work around file servers that crashed several times a day.  NFS killed
> the faster RFS protocol just as Sun file servers changed into
> exceptionally reliable machines.
> In the mid-90s there were people loudly supporting UFS and its
> synchronous metadata updates, even while their filesystems were silently
> losing data during crashes.  But the directory structure was consistent..
> For the Scyld cluster system we decided to be completely filesystem
> agnostic.  That's not because we consider the filesystem unimportant.
> It's because we consider it vitally important, both for performance and
> scalability.
> The problem is that there is no single file system that can give us the
> single-system consistency combined with broad high performance on low
> cost hardware.  We decided that the filesystems would have to be matched
> to the hardware configuration and application's needs. The systems
> integrator and administrator can configure the system, without changing
> the architecture or user interaction.  We make this easy with
> single-point driver installation (kernel drivers exist only on the
> master) and single point, one-time administration (/etc/beowulf/fstab
> supports macros, the "cluster" netgroups and per-node exceptions).
> While our base slave node architecture is diskless, we recommend
>    A local disk for swap space and a node-local filesystem.
>    NFS mounting master:/home
>    PVFS for large temporary intermediate files.
>    Sistina GFS with a fibre channel SAN for consistent databases.
> The reasoning behind this is that
>     Local disk is the least expensive bandwidth you can get.
>        But it has version skew problem if you use it for persistent data.
>     NFS (especially v2) is great for _reading_ small (<8KB) configuration
>        files.  But avoid it for writing, executables and large files.
>     PVFS is the worlds fastest cluster file system, but only works well
>       for carefully laid out large files.
>     GFS is great for transaction consistency and general purpose
>       large-site storage, but there is an (inevitable?) $ and
>       scalability/performance cost for its semantics.
> Bottom line: until we have the do-it-all cluster filesystem, we have to
> provide a reasonable default and interchangable tools to do the job.
> --
> Donald Becker                           becker at
> Scyld Computing Corporation   
> 410 Severn Ave. Suite 210               Second Generation Beowulf Clusters
> Annapolis MD 21403                      410-990-9993
> _______________________________________________
> Beowulf mailing list, Beowulf at
> To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list