[Beowulf] filesystem metadata mining tools

Bruno Coutinho coutinho at dcc.ufmg.br
Sat Sep 12 12:59:57 PDT 2009


This tool do can do part of what you want:
http://www.chiark.greenend.org.uk/~sgtatham/agedu/

This display files by size and color file by type.
http://gdmap.sourceforge.net/

Perhaps agedu can handle large subsets of your files, but gdmap is desktop
oriented.


2009/9/12 Skylar Thompson <skylar at cs.earlham.edu>

> Rahul Nabar wrote:
> > As the number of total files on our server was exploding (~2.5 million
> > / 1 Terabyte) I
> > wrote a simple shell script that used find to tell me which users have
> how
> > many. So far so good.
> >
> > But I want to drill down more:
> >
> > *Are there lots of duplicate files? I suspect so. Stuff like job
> submission
> > scripts which users copy rather than link etc. (fdupes seems puny for
> > a job of this scale)
> >
> > *What is the most common file (or filename)
> >
> > *A distribution of filetypes (executibles; netcdf; movies; text) and
> > prevalence.
> >
> > *A distribution of file age and prevelance (to know how much of this
> > material is archivable). Same for frequency of access; i.e. maybe the
> last
> > access stamp.
> >
> > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of
> > files? etc.
> >
> > I've used cushion plots in the past (sequiaview; pydirstat) but those
> > seem more desktop oriented than suitable for a job like this.
> >
> > Essentially I want to data mine my file usage to strategize. Are there
> any
> > tools for this? Writing a new find each time seems laborious.
> >
> > I suspect forensics might also help identify anomalies in usage across
> > users which might be indicative of other maladies. e.g. a user who had a
> > runaway job write a 500GB file etc.
> >
> > Essentially are there any "filesystem metadata mining tools"?
> >
> >
> What OS is this on? If you have dtrace available you can use that to at
> least gather data on new files coming in, which could reduce your search
> scope considerably. It obviously doesn't directly answer your question,
> but it might make it easier to use the existing tools.
>
> Depending on what filesystem you have you might be able to query the
> filesystem itself for this data. On GPFS, for instance, you can write a
> policy that would move all files older than, say, three months to a
> different storage pool. You can then run that policy in a preview mode
> to see what files would have been moved. The policy scan on GPFS is
> quite a bit faster than running a find against the entire filesystem, so
> it's a definite win.
>
> --
> -- Skylar Thompson (skylar at cs.earlham.edu)
> -- http://www.cs.earlham.edu/~skylar/<http://www.cs.earlham.edu/%7Eskylar/>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090912/eecf3999/attachment.html>


More information about the Beowulf mailing list