[Beowulf] filesystem metadata mining tools

Sat Sep 12 16:02:10 PDT 2009

On 9/12/09 8:10 AM, "Rahul Nabar" <rpnabar at gmail.com> wrote:

> As the number of total files on our server was exploding (~2.5 million
> / 1 Terabyte) I
> wrote a simple shell script that used find to tell me which users have how
> many. So far so good.
> 
> But I want to drill down more:
> 
> *Are there lots of duplicate files? I suspect so. Stuff like job submission
> scripts which users copy rather than link etc. (fdupes seems puny for
> a job of this scale)
> 
> *What is the most common file (or filename)
> 
> *A distribution of filetypes (executibles; netcdf; movies; text) and
> prevalence.
> 
> *A distribution of file age and prevelance (to know how much of this
> material is archivable). Same for frequency of access; i.e. maybe the last
> access stamp.
> 
> * A file size versus number plot. i.e. Is 20% of space occupied by 80% of
> files? etc.
> 

Another useful application for such a tool would be to get better KLOC
counts of source code trees.  I find that our trees have lots of duplication
among branches (e.g. Everyone has a "test.c" for unit test in with their
modules, and all of them are pretty similar)