[Beowulf] dedupe filesystem
Bogdan Costescu
Bogdan.Costescu at iwr.uni-heidelberg.de
Wed Jun 3 08:37:01 PDT 2009
On Wed, 3 Jun 2009, Joe Landman wrote:
> It might be worth noting that dedup is not intended for high
> performance file systems ... the cost of computing the hash(es)
> is(are) huge.
Some file systems do (or claim to do) checksumming for data integrity
purposes, this seems to me like the perfect place to add the
computation of a hash - with data in cache (needed for checksumming
anyay), the computation should be fast. This would allow runtime
detection of duplicates, but would make detection of duplicates
between file systems or for backup more cumbersome as the hashes would
need to be exported somehow from the file system.
One issue that was not mentioned yet is the strength/length of the
hash - within one file system, it's known what are the limitations of
number of blocks, files, file sizes, etc. and the hash can be chosen
such that there are no collisions. By taking an arbitrarily large
number of blocks/files as can be available on a machine or network
with many large devices or file systems, the same guarantee doesn't
hold anymore.
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de
More information about the Beowulf
mailing list