[Beowulf] since we are talking about file systems ...

Robert G. Brown rgb at phy.duke.edu
Tue Jan 24 17:57:13 PST 2006

On Wed, 25 Jan 2006, PS wrote:

> You're absolutely correct, of course, in what you state: Google doesn't 
> access millions of files in a split second. But Google finds one file in 
> millions in a split second.
> I still think, though, that indexing makes sense on a PC - with practically 
> any underlying file system. More than ten years ago I used a large  (for that 
> time) database of about 200 million files - which was CDROM based; the file 
> system was PC Dos and the indexing was done on top of that in a sort of front 
> end.  To find/access any one of the 200 mio files on the slow CDROM took 
> never longer than 0.8 seconds. So, we didn't touch the "kernel" of dos at all 
> and still got fast results. I think a similar approach should work on a Linux 
> system. In my opinion this could and should be done outside of the kernel. 
> What do you think?

Wow, 200 million files on a CDROM that only holds 6-700 million bytes?
As James Bond used to say "that's a neat trick...";-)

I think that you're confusing files (which are particular constructs or
abstractions to the kernel, especially in Unix where "everything is a
file" to the kernel) and database records, which are typically IN a
single (indexed) file, sometimes a very large file.  Back on the
original PC, the "kernel" (as you put it:-) actually had both
abstractions -- a sequential access file model and a "random access"
file model that was fairly obviously designed with databases in mind.  I
vaguely remember doing a bit of BASICA programming with both paradigms
for one reason or another -- probably writing a personal DB to index my
record collection or something.

In most implementations of Unix, the file abstraction is associated with
certain structures such that 200 million EMPTY files would occupy more
than three bytes each.  If you look at the returned struct from an
fstat, it looks (to me, anyway, I'm not bothering to write a code
fragment and measure via sizeof:-) like it is some 50 or 60 bytes long
all by itself.  That is the tip of a fairly deep iceberg.

However, I think that we're all in agreement here -- for any given
application that has extensive data management requirements, competent
code design requires that one treat the data management problem
seriously and consider what is known in computer science about ways of
managing the data efficiently and design a suitable vehicle for doing
so.  In some (quite a lot, really) cases the "best" solution will be
something like a real SQL database (gasp) because the people who design
databases spend a lot of very high powered FTE uberprogrammer effort
designing their programs, but they come at the cost of learning to use
real database programs.  In others it will be worth designing a
customized hash of some sort or another.  In most cases it won't be a
good idea to hack the solution into the kernel;-)


> Paul
> Robert G. Brown wrote:
>> On Sun, 22 Jan 2006, PS wrote:
>>> Indexing is the key;  observe how Google accesses millions of files in 
>>> split seconds; this could easily be achieved in a PC file system.
>> I think that you mean the right thing, but you're saying it in a very
>> confusing way.
>> 1) Google doesn't access millions of files in a split second, it AFAIK
>> accesses relatively few files that are hashes (on its "index server")
>> that lead to URLs in a split second WITHOUT actually traversing millions
>> of alternatives (as you say, indexing is the key:-).  File access
>> latency on a physical disk makes the former all but impossible without
>> highly specialized kernel hacks/hooks, ramdisks, caches, disk arrays,
>> and so on.  Even bandwidth would be a limitation if one assumes block
>> I/O with a minimum block size of 4K -- 4K x 1M -> 4 Gigabytes/second
>> (note BYTES, not bits) exceeds the bandwidth of pretty much any physical
>> medium except maybe memory.
>> 2) It cannot "easily" be achieved in a PC file system, if by that you
>> mean building an actual filesystem (at the kernel level) that supports
>> this sort of access.  There is a lot more to a scalable, robust,
>> journalizeable filesystem than directory lookup capabilities.  A lot of
>> Google's speed comes from being able to use substantial parallelism on a
>> distributed server environment with lots of data replication and
>> redundancy, a thing that is impossible for a PC filesystem with a number
>> of latency and bandwidth bottlenecks at different points in the dataflow
>> pathways towards what is typically a single physical disk on a single
>> e.g.  PCI-whatever channel.
>> I think that what you mean (correctly) is that this is something that
>> "most" user/programmers would be better off trying to do in userspace on
>> top of any general purpose, known reliable/robust/efficient PC
>> filesystem, using hashes customized to the application.  When I first
>> read your reply, though, I read it very differently as saying that it
>> would be easy to build a linux filesystem that actually permits millions
>> of files per second to be accessed and that this is what Google does
>> operationally.
>>    rgb
>>> Paul
>>> Joe Landman wrote:
>>>> Methinks I lost lots of folks with my points ...
>>>> Major thesis is that on well designed hardware/software/filesystems, 
>>>> 50000 files is not a problem for accesses (though from a management point 
>>>> of view it is a nightmare).  For poorly designed/implemented file systems 
>>>> it is a nightmare.
>>>> Way back when in the glory days of SGI, I seem to remember xfs being 
>>>> tested with millions of files per directory (though don't hold me to that 
>>>> old memory).  Call this hearsay at this moment.
>>>> A well designed and implemented file system shouldn't bog you down as you 
>>>> scale out in size, even if you shouldn't.  Its sort of like your car.  If 
>>>> you go beyond 70 MPH somewhere in the US that supports such speeds, your 
>>>> transmission shouldn't just drop out because you hit 71 MPH.
>>>> Graceful degradation is a good thing.
>>>> Joe

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list