[Beowulf] dedupe filesystem

Fri Jun 5 12:09:40 PDT 2009

On Jun 5, 2009, at 1:12 PM, Joe Landman wrote:

> Lux, James P wrote:
>
> It only looks at raw blocks.  If they have the same hash signatures  
> (think like MD5 or SHA ... hopefully with fewer collisions), then  
> they are duplicates.
>
>> maybe a better model is a  “data compression” algorithm on the fly.
>
> Yup this is it, but on the fly is the hard part.  Doing this  
> comparison is computationally very expensive.  The hash calculations  
> are not cheap by any measure.  You most decidedly do not wish to do  
> this on the fly ...
>
>> And for that, it’s all about trading between cost of storage space,  
>> retrieval time, and computational effort to run the algorithm.
>
> Exactly.

I think the hash calculations are pretty cheap, actually.  I just  
timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per  
second, on one core (from the disk cache).  That is substantially  
faster than the disk transfer rate.  If you have a parallel  
filesystem, you can parallize the hashes as well.

-L