[Beowulf] Re: dedupe Filesystem
Lawrence Stewart
stewart at serissa.com
Wed Jun 3 05:24:19 PDT 2009
I know a little bit about this from a time before SiCortex.
The big push for deduplication came from disk-to-disk backup
companies. As you can imagine, there is a huge advantage for
deduplication if the problem you are trying to solve is backing up a
thousand desktops.
For that purpose, whole file duplicate detection works great.
The next big problem is handling incremental backups. Making them run
fast is important. And some applications, um, Outlook, have huge
files (PST files) that change in minor ways every time you touch them.
The big win here is the ability to detect and handle duplication at
the block or sub-block level. This can have enormous performance
advantages for incremental backups of those 1000 huge PST files.
The technology for detecting sub-block level duplication is called
"Shingling" or "rolling hashes" and was invented by Rivest (big
surprise I guess!) and perhaps Mark Manasse. It is wicked clever stuff.
The same schemes are used now for finding plagiarism among pages on
the internet.
I probably don't need to remind anyone here that deduplication on a
live filesystem (as opposed to backups) can have really bad
performance effects. Imagine if you have to move the disk arms around
for every file for every block of every file. Modern filesystems do
well at keeping files contiguous and often keep all the files of a
directory nearly. This locality gets trashed by deduplicaiton. This
won't matter if the problem is making backups smaller or making
incrementals run faster, but it is not good for the performance of a
live filesystem.
-Larry/thinking about what to do next
More information about the Beowulf
mailing list