[Beowulf] Small files

Fri Jun 13 11:37:16 PDT 2014

On 6/13/14, 6:17 AM, "Skylar Thompson" <skylar.thompson at gmail.com> wrote:

>We've recently implemented a quota of 1 million files per 1TB of
>filesystem space. 

So you¹re penalizing people with files smaller than 1 Mbyte?

>And yes, we had to clean up a number of groups' and
>individuals' spaces before implementing that. There seems to be a trend
>in the bioinformatics community for using the filesystem as a database.
>I think it's enabled partly by a lack of knowledge of scaling and
>speedup in the community, since so much stuff still runs on laptops and
>desktops. I'd really like to teach a basic scientific computing class at
>work to address those concepts, but that would take more time than I
>have right now.
>

I¹ve always advocated using the file system as a database: in the sense of
³lots of little files, one for each data blob², where ³data blob² is
bigger than a few bytes, but perhaps in the hundreds/thousands of bytes or
larger.

1) Rather than spend time implementing some sort of database, the file
system is already there
2) The file system is likely optimized better for whatever platform it is
running. It runs ³closer to the metal², and hopefully is tightly
integrated with things like cacheing, and operating system tricks.
3) The file system is optimized for allocation and deallocation of space,
so I don¹t have to write that, or hope that my ³database engine of choice²
does it right.
4) backup and restore of parts of the data is straightforward without
needing any special utilities(e.g. File timestamp gives mod dates, etc.)

Back in the late 80s, early 90s, I built a (large at the time) system
which stored variable length data (on the order of a few kbytes) in
individual files in the MS-DOS file system (with some cleverness of
hashing names to directory names, to provide a quasi balanced tree).  I
did this after going through all the popular database apps of the time
(Paradox, dBase, FoxPro, Access, etc.) and found that just using DOS was
faster and consumed less space on disk.

OK, I won¹t claim that the PC database engines of the time were
particularly optimized, however, the fact remains that *most of the time*
at any given time, more time and energy has been spent on making file
open/close in the file system faster than on making the database program
faster.  Most database app work is on improving things like search
capability, scripting languages, and databasey things (doing relational
joins, etc.)

And you can get bitten: Novell preallocated directory space on the server.
 Coming in with an application that required 10s of thousands of small
files was NOT in the usual use case at the time.

Right now, I¹m doing a lot of things like finite element models of EM
propagation.  What more natural way to store each time step than in a
directory or file named with the time step? When I want to make a movie of
the data, it¹s easy to write a script that iterates through all the
directories and calls the ³turn data cube into .png image² program (and
it¹s almost EP to put it on a cluster). And then finally, lots of
animation generators are perfectly happy to take a series of frames as
FRAME001.png, FRAME002.png, FRAME003.png, etc.

I¹d much rather do this, than try to figure out some sort of database
engine where I¹d create records of frames, etc.

Likewise, I¹m collecting experimental data from a sensor, and each ³data
take² (a half dozen files totaling about 1 Mbyte.. Raw data, an image from
a camera, cal data, etc. ) winds up in a directory named
/sensorname/yyyymmdd/yyyymmdd hhmmss/  (from which you can guess that
there are no more than 86,400 data takes in a day)

Moving logical groupings of data around is easy.  If I want to give
someone a day¹s data, it¹s easy.  If I want to copy the data off the
sensor and into a large repository using sneaker net it¹s easy.  If I want
to reprocess all the data in an EP way, it¹s easy.

Sure, I could try and define a big database, and store the raw data and
intermediate products in the database, but it would be tedious and
painful, and inevitably would require extra work when things changed
format, or I wanted to add a data item.

Further, then everyone I work with would ALSO have to have that database
application.

And, of course, all those collaborators are running on their desktop or
laptop machines, typically on a subset of the data.

>