[Beowulf] dedupe filesystem

Lawrence Stewart stewart at serissa.com
Thu Jun 25 11:44:57 PDT 2009


Rahul Nabar wrote:
>
>
> On Tue, Jun 2, 2009 at 12:39 PM, Ashley Pittman <ashley at pittman.co.uk 
> <mailto:ashley at pittman.co.uk>> wrote:
>
>     Fdupes scans the filesystem looking for files where the size
>     matches, if
>     it does it md5's them checking for matches and if that matches it
>     finally does a byte-by-byte compare to be 100% sure.
>
>
> Why is a full byte-by-byte comparison needed even after a md5 sum 
> matches? I know there is a vulnerability in md5 but that's more of a 
> security thing and by random chance super unlikely , right?
>
> Or, why not use another checksum that is as yet not vulnerable? SHA1? 
> SHA2? etc.? Or are they way too expensive to compute?
>
> Just curious....
>
> -- 
> Rahul
> ------------------------------------------------------------------------
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>   
It is a tradeoff between the size of the hash and the amount of storage 
it takes.  Due to the "birthday problem," when you use an N bit hash, 
the probability of an erroneous hash collision rises to about 1/2 when
the number of blocks in the file system rises to 2**(N/2).  If you have 
64 bit hashes and 1024 byte blocks, for example, this happens with about 
4 TB of data, which is not much.  You could choose to use 128 bit 
hashes, and cut down this problem by another factor of 2**32, but it may 
work out cheaper to do the real compare and be sure, one time out of 
2**32 than to double the storage for hashes.

If you <are> going to depend on hash collisions meaning the data is 
identical, then you really need 256 bit hashes, probably, because you 
want the probability of an erroneous collision to be much smaller than 
other sources of data loss, which ought to be down in the 2**-45 area. 
(1e-15 ish).

The security properties of the hashes are not so important, just their 
randomness.

-Larry





More information about the Beowulf mailing list