opinion on XFS

Wed May 8 13:03:46 PDT 2002

On Wed, 8 May 2002, Todd Merritt wrote:

> "Roger L. Smith" wrote:
> > 
> > I'm pretty confident of the fact that it was ext2.  We were running RedHat
> > 7.0 (or 7.1, I don't remember which) at the time.  We started having very
> > serious file corruption issues with the data partition on the head node
> > when the cluster was running under heavy loads  (we had 324 processors at
> > the time, running MPI jobs under nearly 100% utilization).
> > 
> We were running our news server on a 2.0.x kernel with ext2 for years
> without any real trouble.  We upgraded it to 7.2, but had to run ext2
> due to the fact that our backup solution did not yet support ext3 and we
> had crashed 1-2 days.  Perhaps there are some issues with using ext2 on
> a 2.4.x kernel ?

Or there may have been issues with RH 7.0, or (as I suggested) the
particular hardware combination of the server.  We routinely beat the
crap out of e.g. RAIDs (md on top of ide or scsi, all ext2 filesystems
but spread out across multiple disks), single disk systems, file and
mail and web servers, workstations, database servers, and have had no
"fundamental" difficulty with ext2 from 1.0.x kernels on, although there
have been kernel releases and snapshots with problems nearly anywhere,
so why not ext2?  We have had zero problems with 6.1, 6.2, 7.0 kind of
sucked, 7.1 was good again (but early 2.4 kernels had a few issues), 7.2
has been quite good, and I have lots of hopes for 7.3 as the latest
kernel supports the UDMA 133 Promise IDE controllers out of the box.

We're currently testing a RAID built with them.  We test the RAIDS by
running system exercisers on them that create, write, read files on
systems at load averages up into the tens and higher for days and
longer.  Usually if we have trouble, it turns out to be things like:

  a) Loose cable (IDE or SCSI).  A loose cable is one of the most
insidious of all evils, as it can "work" fine except when it finally
doesn't and corrupts the hell out of things.
  b) Too long or incorrectly installed cables.  Our local vendor has a
whole shelf box full of 36" UDMA-100 IDE cables.  This is wonderfully
humorous, given that UDMA cables are supposed to be 18" or less.  Their
18" cables are 18" to the nearer connector, more like 24" to the far
connector.  When I asked them about this, they simply replied that
although they were longer than spec, they'd never had anyone report a
failure.  Of course, most of their customers are running WinXX, which is
NOT exactly a systems load, and if a WinXX system DOES crash and ruin
the disk, who blames the cable?  They crash and corrupt all the time
anyway.  One also has to resist the temptation to install the cable with
a dangling end even if it means rearranging your physical layout or
twisting your cable into an odd (but not TOO odd, ribbon cables don't
like creased folds!) angle -- the cable MUST have a drive at the end to
terminate it.  SCSI chains also require termination, have length
restrictions, and so forth and this too can create problems for the
unwary if an ignorant vendor builds their system.  
  c) Bad BIOS settings.
  d) Other poisonous hardware or system specific problems.

We ALWAYS look into systems when we get them, and even from our local
vendor whom we "trust" and who builds boxes for us to our
microspecifications we get all sorts of crap installations -- cables
hooked up incorrectly to the motherboard, dangling connections, overlong
cables (especially in e.g.  super tower cases where they put the drives
up in the 5.25" bays and have to stretch to reach the IDE controllers).

Do NOT trust vendors' "burn in" -- they generally don't do anything like
properly exercise system components, often they are just supplying power
to a system for 24 hours or so to see if anything critical goes pop (as
it often enough does; this isn't a terrible way to begin:-).  If they
burn in with WinXX, they probably cannot exercise the system -- it
doesn't have the tools.  A turnkey linux vendor COULD burn in with good
tools and really work out the system, but even that won't show
everything -- sometimes cables come loose being carried in from the car
after leaving the vendor.

In every case where we've encountered burn in trouble with a server or
workstation disk, replacing the disk(s), reconfiguring, recabling, or
some other hardware tweak (rarely, using a different disk or controller
altogether) has solved it.

The reason I find it hard to doubt the general stability of ext2 proper
is that it would be very difficult to compute the power of 10 number of
bytes that have flowed into and out of ext2 filesystems on servers and
workstations without serious problems -- numbers like 10^24 to 10^30 or
even more come to mind.  We have 100's of MB in constant flow in project
space servers (well, now it is ext3, but ext3 is ext2 plus a journal --
one can back off ext3 to ext2 without doing "anything" to the
filesystem, just as one can convert a fs from ext2 to ext3 without
losing data), and we're one of hundreds of sites like this, some with
far larger data banks in far greater churn.  Some can cycle terabytes of
e.g.  image data in and out as fast as a bank of systems can generate it
-- they make >>movies<< with linux render farms, most likely on top of
ext2 filesystems.  If ext2 had any really serious, fundamental bugs like
those that are implied here, linux on a default of ext2 would simply
never have been taken seriously, instead of being nearly universal.

That said, I don't doubt those folks that claim that XFS solved ext2
corruption problems.  Even if the real problem WAS e.g. the cabling, it
could have been a very subtle timing problem that only arose when data
rates exceeded some critical threshold that might well have been
different for XFS.  Or XFS may have had a write cycle that "repaired"
some of the hard errors generated by the original corrupted write.  In
that sense perhaps ext2 was less robust than it might have been, but the
real problem might still have been hardware.

Or yes, perhaps ext2 has a very deep bug in it somewhere that we just
haven't reached the right area of systems phase space to tweak.  Could
be, just seems less likely than the alternatives.

  rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu