[Beowulf] dedupe filesystem

Greg Lindahl lindahl at pbm.com
Mon Jun 8 13:52:48 PDT 2009


>> It might be worth noting that dedup is not intended for high  
>> performance file systems ... the cost of computing the hash(es)  
>> is(are) huge.
>
> Some file systems do (or claim to do) checksumming for data integrity  
> purposes, this seems to me like the perfect place to add the computation 
> of a hash - with data in cache (needed for checksumming anyay), the 
> computation should be fast.

Filesystems may call it a "checksum" but it's usually a hash. We use a
Jenkins hash, which is fast and a lot better than, say, the TCP
checksum. But it's a lot weaker than an expensive hash.

If your dedup is going to fall back to byte-by-byte comparisons, it could
be that a weak hash would be good enough.

-- greg





More information about the Beowulf mailing list