[Beowulf] filesystem metadata mining tools

Tue Sep 15 11:26:56 PDT 2009

I have used perl in the past to gather summaries of file usage like
this. The details are fuzzy(it was a couple of years ago), but I think I
did a 'find -ls' to a text file and then used perl to parse the file and
and add up the various statistics. I wasn't gathering as many statistics
as you, but it was pretty easy to write for a novice perl programmer
like me.

Prentice

Rahul Nabar wrote:
> As the number of total files on our server was exploding (~2.5 million
> / 1 Terabyte) I
> wrote a simple shell script that used find to tell me which users have how
> many. So far so good.
> 
> But I want to drill down more:
> 
> *Are there lots of duplicate files? I suspect so. Stuff like job submission
> scripts which users copy rather than link etc. (fdupes seems puny for
> a job of this scale)
> 
> *What is the most common file (or filename)
> 
> *A distribution of filetypes (executibles; netcdf; movies; text) and
> prevalence.
> 
> *A distribution of file age and prevelance (to know how much of this
> material is archivable). Same for frequency of access; i.e. maybe the last
> access stamp.
> 
> * A file size versus number plot. i.e. Is 20% of space occupied by 80% of
> files? etc.
> 
> I've used cushion plots in the past (sequiaview; pydirstat) but those
> seem more desktop oriented than suitable for a job like this.
> 
> Essentially I want to data mine my file usage to strategize. Are there any
> tools for this? Writing a new find each time seems laborious.
> 
> I suspect forensics might also help identify anomalies in usage across
> users which might be indicative of other maladies. e.g. a user who had a
> runaway job write a 500GB file etc.
> 
> Essentially are there any "filesystem metadata mining tools"?
>