[Beowulf] filesystem metadata mining tools

Sat Sep 12 11:34:25 PDT 2009

Rahul Nabar wrote:
> As the number of total files on our server was exploding (~2.5 million
> / 1 Terabyte) I
> wrote a simple shell script that used find to tell me which users have how
> many. So far so good.
>
> But I want to drill down more:
>
> *Are there lots of duplicate files? I suspect so. Stuff like job submission
> scripts which users copy rather than link etc. (fdupes seems puny for
> a job of this scale)
>
> *What is the most common file (or filename)
>
> *A distribution of filetypes (executibles; netcdf; movies; text) and
> prevalence.
>
> *A distribution of file age and prevelance (to know how much of this
> material is archivable). Same for frequency of access; i.e. maybe the last
> access stamp.
>
> * A file size versus number plot. i.e. Is 20% of space occupied by 80% of
> files? etc.
>
> I've used cushion plots in the past (sequiaview; pydirstat) but those
> seem more desktop oriented than suitable for a job like this.
>
> Essentially I want to data mine my file usage to strategize. Are there any
> tools for this? Writing a new find each time seems laborious.
>
> I suspect forensics might also help identify anomalies in usage across
> users which might be indicative of other maladies. e.g. a user who had a
> runaway job write a 500GB file etc.
>
> Essentially are there any "filesystem metadata mining tools"?
>
>   
What OS is this on? If you have dtrace available you can use that to at
least gather data on new files coming in, which could reduce your search
scope considerably. It obviously doesn't directly answer your question,
but it might make it easier to use the existing tools.

Depending on what filesystem you have you might be able to query the
filesystem itself for this data. On GPFS, for instance, you can write a
policy that would move all files older than, say, three months to a
different storage pool. You can then run that policy in a preview mode
to see what files would have been moved. The policy scan on GPFS is
quite a bit faster than running a find against the entire filesystem, so
it's a definite win.

-- 
-- Skylar Thompson (skylar at cs.earlham.edu)
-- http://www.cs.earlham.edu/~skylar/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 251 bytes
Desc: OpenPGP digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090912/567eee1f/attachment.sig>