[Beowulf] since we are talking about file systems ...

Tue Jan 17 12:03:22 PST 2006

On Tue, 17 Jan 2006, Joe Landman wrote:

> I created a simple perl code to create lots of small files in a pre-existing 
> directory named "dir" below the current directory.  This code runs like this
>
> 	dualcore:/local/files # ./files.pl 50000
> 	Creating N=50000 files
> 	Creating N files took 20 wallclock secs ( 0.57 usr + 13.99 sys = 
> 14.56 CPU) seconds
>
> then looking at the files
>
> 	dualcore:/local/files # time ls dir | wc -l
> 	50002
> 		real    0m0.131s
> 	user    0m0.094s
> 	sys     0m0.040s
>
> also doesn't take much time.  Then again, this may be due to caching, the md0 
> raid device the filesystem is on, or any number of other things.

Try opening and closing.  ls doesn't really "look" at the files; it just
lists the contents of the directory file, so it is bw limited on one
large directory file.  You want to force a stat of the actual files
(which will crossreference e.g. a password and group file hit however
they are served to see if you own or belong to the right group and can
read, write, execute the files) and time the latency associated with the
creation of the kernel structures necessary to read from the file.  Or
go ahead and actually write some data (say a 1K block or so) to the file
on create and then read it back in on the input pass.

Although I agree, it >>is<< very difficult to futz caching on a linux
box.  It really really works, and will screw up disk benchmarks in a
heartbeat unless you are really really careful.  IIRC, lmbench has some
very nice file benchmarks that show e.g. create latencies and so on, and
then there is bonnie.  Or you can play around by hand and figure out
what order you have to create/touch/modify files and then try to read
them to defeat caching.  I was always amazed by the difference between
creating a file and reading it the first time and then reading it in the
second time.  Seconds can shrink to too small to properly register, as
that second read is generally a straight memory transfer.

> What's interesting about this is the amount of wasted space more than 
> anything.
>
> Each file is on the order of 21 bytes or less.  50000 of them should be about 
> 1 MB.  Right?

IIRC, each file is a minimum of 4K, so 50K of them should be 200MB.
Which it is, depending on what "M" means today.

Note that this is a tunable parameter.  See "man mke2fs".  You can make
the minimum block size as small as 1K, at least if your hardware
supports it.  You can also play with the number of inodes your fs will
support, usually at the same time you play with the minimum block size
since if you run out of inodes before you run out of disk you're
screwed, but if you run out of disk before you run out of inodes you
waste at least some disk.  There are some other tunable parameters, and
there is hdparm as well.

The defaults are set for "normal" usage, a mix of file sizes from large
to small.  If a disk is always going to be used for many 1 block files,
this tuning is likely not optimal.  If it is going to be used for three
enormous files it is likely not optimal (although it is also likely that
you'll never do a measurement fine enough to notice).

> 	dualcore:/local/files/dir # ls -alF f10011.dat
> 	-rw-r--r--  1 root root 21 Jan 17 13:10 f10011.dat
> 	dualcore:/local/files/dir # du -h .
> 	198M    .
>
> ext3 isn't any better, giving about 197M.

This will be true on most filesystems for performance reasons, although
there may be fs (like ext*) that permit you SOME latitude for tuning
according to expected usage.

There is a really, really lovely document Sun wrote back in the 80's
called something like "The Sun Server Configuration and Capacity
Planning Guide", that I have a copy of somewhere that Sun gave me (I was
their best friend in those halcyon days of SunOS) -- I even used to have
a postscript image of it but cannot find it to check the copyright to
see if I can safely post it.  However, it had all sorts of very
practical advice on how to optimize disks layout to exploit e.g.
geometry (outside is faster than inside), how to tune number of inodes
against minimum file size, the differences between data-intensive (bw
dominant) and inode-intensive (latency domininant) I/O.

It would be interesting to see how a FAT system used under native linux
works etc.  Not interesting enough for me to burn time testing, but
interesting...;-)

The only other comment is that while perl IS quite efficient these days,
I'd only really trust C for this kind of benchmarking in general.
Otherwise there is a largely unpredictable part of any timings that is
due to perl's overhead.  What you're really timing is system calls, e.g.
fstat, fopen, fclose, fprintf, fscanf (or open/close/read/write if you
prefer your i/o uncooked).  It helps to have complete control over the
buffers and data structures you are using during the I/O process, as
perl's data types are simple on the surface but unknowably complex
underneath, and indirection adds to latency (noticably/significantly for
small timings).

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu