[Beowulf] filesystem metadata mining tools
rpnabar at gmail.com
Sat Sep 12 08:10:43 PDT 2009
As the number of total files on our server was exploding (~2.5 million
/ 1 Terabyte) I
wrote a simple shell script that used find to tell me which users have how
many. So far so good.
But I want to drill down more:
*Are there lots of duplicate files? I suspect so. Stuff like job submission
scripts which users copy rather than link etc. (fdupes seems puny for
a job of this scale)
*What is the most common file (or filename)
*A distribution of filetypes (executibles; netcdf; movies; text) and
*A distribution of file age and prevelance (to know how much of this
material is archivable). Same for frequency of access; i.e. maybe the last
* A file size versus number plot. i.e. Is 20% of space occupied by 80% of
I've used cushion plots in the past (sequiaview; pydirstat) but those
seem more desktop oriented than suitable for a job like this.
Essentially I want to data mine my file usage to strategize. Are there any
tools for this? Writing a new find each time seems laborious.
I suspect forensics might also help identify anomalies in usage across
users which might be indicative of other maladies. e.g. a user who had a
runaway job write a 500GB file etc.
Essentially are there any "filesystem metadata mining tools"?
More information about the Beowulf