[Beowulf] dedupe filesystem

Joe Landman landman at scalableinformatics.com
Wed Jun 3 06:18:19 PDT 2009


It might be worth noting that dedup is not intended for high performance 
file systems ... the cost of computing the hash(es) is(are) huge.  Dedup 
is used *primarily* to prevent filling up expensive file systems of 
limited size (e.g. SAN units with "fast" disks).  For this crowd, 
20-30TB is a huge system, and very expensive.  Dedup (in theory) makes 
these file systems have a greater storage density, and also allows for 
faster DR, faster backup, and whatnot else ... assuming that Dedup is 
meaningful for the files stored.

Its fine for slower directories, but the costs to Dedup usually involve 
a hardware or software layer which isn't cheap.

Arguably, Dedup is more of a tactical effort on the part of the big 
storage vendors to reduce the outflow of their customers to less 
expensive storage modalities and products.  It works well in some 
specific cases (with lots of replication), and poorly in many others. 
Think of trying to zip up a binary file with very little in the way of 
repeating patterns.  Dedup is roughly akin to RLE encoding, with a 
shared database of blocks, using hash keys to represent specific blocks. 
  If your data has lots of these identical blocks, then Dedup can save 
you lots of space.  Point that block to the dictionary with the hash 
key, and when you read that block, pull it from the dictionary.

This is how many of the backup folks get their claimed 99% compression BTW.

They don't get this in general for a random collection of different 
files.  They would get it for files that Dedup software can compress.

Another technique is storing original and diffs, or the current, and 
backward diffs.  So if you have a block that differs in two characters, 
point to the original and a diff.

The problem with this (TANSTAAFL) is that your dictionary (hash->block 
lookup) becomes your bottleneck (probably want a real database for 
this), and that this can fail spectacularly in the face of a) high data 
rates, b) minimal file similarity, c) many small operations on files.

If you have Dedup anywhere, in your backup would be good.

Just my $0.02.

Joe

Michael Di Domenico wrote:
> On Tue, Jun 2, 2009 at 1:39 PM, Ashley Pittman <ashley at pittman.co.uk> wrote:
>> I'm not sure I understand the question, if it's a case of looking for
>> duplicate files on a filesystem I use fdupes
>>
>> http://premium.caribe.net/~adrian2/fdupes.html
> 
> Fdupes is indeed the type of app i was looking for.  I did run into
> one catch with it though, on first run it trounced down into a NetApp
> snapshot directory.  Dupes galore...
> 
> It would be nice if it kept a log too, so that if the files are the
> same on a second go around it didn't have to md5 every file all over
> again.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list