[Beowulf] filesystem metadata mining tools

Sat Sep 12 08:10:43 PDT 2009

As the number of total files on our server was exploding (~2.5 million
/ 1 Terabyte) I
wrote a simple shell script that used find to tell me which users have how
many. So far so good.

But I want to drill down more:

*Are there lots of duplicate files? I suspect so. Stuff like job submission
scripts which users copy rather than link etc. (fdupes seems puny for
a job of this scale)

*What is the most common file (or filename)

*A distribution of filetypes (executibles; netcdf; movies; text) and
prevalence.

*A distribution of file age and prevelance (to know how much of this
material is archivable). Same for frequency of access; i.e. maybe the last
access stamp.

* A file size versus number plot. i.e. Is 20% of space occupied by 80% of
files? etc.

I've used cushion plots in the past (sequiaview; pydirstat) but those
seem more desktop oriented than suitable for a job like this.

Essentially I want to data mine my file usage to strategize. Are there any
tools for this? Writing a new find each time seems laborious.

I suspect forensics might also help identify anomalies in usage across
users which might be indicative of other maladies. e.g. a user who had a
runaway job write a 500GB file etc.

Essentially are there any "filesystem metadata mining tools"?

-- 
Rahul