[Beowulf] dedupe filesystem

Nifty Tom Mitchell niftyompi at niftyegg.com
Tue Jun 2 11:51:57 PDT 2009


On Tue, Jun 02, 2009 at 06:39:40PM +0100, Ashley Pittman wrote:
> On Tue, 2009-06-02 at 12:34 -0400, Michael Di Domenico wrote:
> > does anyone have an opinion on dedup'ing files on a filesystem, but
> > not in the context of backups?  I did a google search for a program,
> > but only seemed to find the big players in the context of backups and
> > block levels.  i just need a file level check and report.
> 
> I'm not sure I understand the question, if it's a case of looking for
> duplicate files on a filesystem I use fdupes
> 
> http://premium.caribe.net/~adrian2/fdupes.html
> 
> > Is scanning the filesystem and md5'ing the files really the best (or
> > only) way to do this?
> 
> Fdupes scans the filesystem looking for files where the size matches, if
> it does it md5's them checking for matches and if that matches it
> finally does a byte-by-byte compare to be 100% sure.  As a result it can
> take a while on filesystems with lots of duplicate files.
> 
> There is another test it could do after checking the sizes and before
> the full md5, it could compare the first say Kb which should mean it
> would run quicker in cases where there are lots of files which match in
> size but not content but anyway I digress.
> 
> Ashley Pittman.
> 

Not realy a digression....   this is a performance oriented list.
Below is my back pocket solution for finding things like multiple copies
of big .iso files.  As you indicate it could be dog slow.

The very hard part is knowing what to do once a duplicate has been found
so I look at all the duplicates with less.

Another difficult part might be meta characters in file names thus the print0.

#!  /bin/bash
# find-duplicate -- released GPL
SIZER=' -size +10240k'
#SIZER=""
DIRLIST=". "
find $DIRLIST  -type f $SIZER -print0 | xargs -0 md5sum |\
	egrep -v "d41d8cd98f00b204e9800998ecf8427e|LemonGrassWigs" |\
sort > /tmp/looking4duplicates
cat /tmp/looking4duplicates |  uniq --check-chars=32 --all-repeated=prepend | less

-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?




More information about the Beowulf mailing list