Need comments about cluster file systems

Donald Becker becker at
Thu Nov 14 15:01:05 PST 2002

On Thu, 14 Nov 2002 hanzl at wrote:

> > > using something like Coda or InterMezzo (for systems where
> > > local disks are quicker than network card).
> > ... The key to good performance is not exceeding the semantic
> > requirements of the application by too much, and the best systems are so
> > transparent to the end user that they don't Need A Capitalized Name.
> > 
> > Our cluster system, for example, uses a specialized whole-file-caching
> > filesystem internally. ...
> I am looking for any working opensource solution for persistent file
> chaching. Please enlighten me if you know any. I looked hard, but I
> might have missed something.

Ours is Open Source, but we don't document it or provide it separately.
It's an internal part of our system, not visible to users.

The simplest, best mental model of our system is how we describe it
architecturally: a system that doesn't require a filesystem, either a
network file system or on a local disk.  The mounted filesystems are
selected to maximum application performance by matching the requirements
of the specific applications.  We make that selection easy by not
requiring an initial filesystem.

Of course the reality is that we do have an internal file system.  It's
a whole-file-caching system that defaults to caching into RAM, with
hooks for programs or modules that do whole file reads (no writes) and
cache cleaning.

> > > (For sure others will point you to PVFS, which IMHO makes sense only
> > > if network card is quicker than local disk.)

With Gigabit Ethernet, the network is once again faster than a local
disk.  With medium size clusters the network bandwidth is even slightly
less expensive than disk bandwidth.  But in general you should count on
local disk bandwidth being the least expensive I/O around.

> I quite like PVFS, but I think it does not solve the problem. AFAIK it
> can get speed of NIC. I want the speed of local harddisk, which is
> much bigger with my hardware. But again, I am ready for any
> enlightenment.

We ship PVFS with our software system, and some of our hardware partners
provide it to customers.  There are access patterns that work
extraordinarily well, and other access patterns where you look to see if
you didn't accidentally mount a floppy disk.  It's not a general purpose
tool, but works great for some applications.

> I do not know any details about GFS but I will be happy to learn them
> if anybody tells me that it can give me what I want.

Same thing: its exactly right for some applications, not fast enough or
scalable enough for others.  It provides tight record-level consistency,
which is a difficult thing to do well.

> I did not say Coda or InterMezzo are great solutions. They are just
> the only solution I found. Bad ones, yes. Coda is too big, InterMezzo
> is not finished.

I don't think think Intermezzo is a Bad solution.  I just think that
"working and actively being improved" should get extra credit.

(Yes, I am biased.  I keep hearing that our old releases "aren't nearly
as good as the stuff that X has announced".  Matt O'Keffe, Rob Ross and
Walt Ligon must feel the same way when Lustre gets top billing after
Intermezzo was never finished.)

Donald Becker				becker at
Scyld Computing Corporation
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

More information about the Beowulf mailing list