[Beowulf] dedupe filesystem

Wed Jun 3 08:37:01 PDT 2009

On Wed, 3 Jun 2009, Joe Landman wrote:

> It might be worth noting that dedup is not intended for high 
> performance file systems ... the cost of computing the hash(es) 
> is(are) huge.

Some file systems do (or claim to do) checksumming for data integrity 
purposes, this seems to me like the perfect place to add the 
computation of a hash - with data in cache (needed for checksumming 
anyay), the computation should be fast. This would allow runtime 
detection of duplicates, but would make detection of duplicates 
between file systems or for backup more cumbersome as the hashes would 
need to be exported somehow from the file system.

One issue that was not mentioned yet is the strength/length of the 
hash - within one file system, it's known what are the limitations of 
number of blocks, files, file sizes, etc. and the hash can be chosen 
such that there are no collisions. By taking an arbitrarily large 
number of blocks/files as can be available on a machine or network 
with many large devices or file systems, the same guarantee doesn't 
hold anymore.

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de