[Beowulf] dedupe filesystem

Sun Jun 7 21:55:10 PDT 2009

Lawrence Stewart wrote:

[...]

>> Yup this is it, but on the fly is the hard part.  Doing this 
>> comparison is computationally very expensive.  The hash calculations 
>> are not cheap by any measure.  You most decidedly do not wish to do 
>> this on the fly ...

The assumption of a high performance disk/file system is implicit here.

>>
>>> And for that, it’s all about trading between cost of storage space, 
>>> retrieval time, and computational effort to run the algorithm.
>>
>> Exactly.
> 
> 
> I think the hash calculations are pretty cheap, actually.  I just timed 
> sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, on 
> one core (from the disk cache).  That is substantially faster than the 
> disk transfer rate.  If you have a parallel filesystem, you can 
> parallize the hashes as well.

Disk transfer rates are 100-120 MB/s these days.  For high performance 
local file systems (N * disks), the data rates you showed for sha1 won't 
cut it.  Especially if they have multiple hash signatures computed in 
order to avoid hash collisions (ala MD5 et al).  The probability of two 
different hashes having two or more unique blocks that have a collision 
in the same manner is somewhat smaller than the probability that a 
single hash function has two (or more) unique blocks that have a 
collision.  So you compute two or more in different ways, to reduce the 
probability of that collision.

For laughs, I just tried this on our lab server.  We have a 500 MB/s 
file system attached to it, so we aren't so concerned about data IO 
bottlenecks.

First run, not in cache:

landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat 
CentOS-5.3-x86_64-bin-DVD.iso |sha1sum
0.02user 12.88system 0:47.22elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
8901280inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0  -

second run, in cache:
landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat 
CentOS-5.3-x86_64-bin-DVD.iso |sha1sum
0.01user 8.79system 0:37.47elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0  -

The null version of the hash, just writing to /dev/null:

landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat 
CentOS-5.3-x86_64-bin-DVD.iso | cat - > /dev/null
0.00user 8.37system 0:10.96elapsed 76%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps

So ~11 seconds of the results here are dealing with the pipes and 
internal unix bits.

Removing the pipe

landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat 
CentOS-5.3-x86_64-bin-DVD.iso  > /dev/null
0.00user 4.95system 0:04.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps

so its about 6 seconds cost to use the pipe when data is in cache,

and the raw filesystem w/o cache

landman at crunch:/data/big/isos/centos/centos5$ dd 
if=CentOS-5.3-x86_64-bin-DVD.iso of=/dev/null iflag=direct bs=32M
135+1 records in
135+1 records out
4557455360 bytes (4.6 GB) copied, 8.72511 s, 522 MB/s

Where I am getting at with this, is every added layer costs you time, 
whether or not the data is in cache or on disk.  Anything getting in 
path of moving data to disk, every pipe, any hash/checksum you compute 
is going to cost you.  The more hashes/checksums you compute, the more 
time you spend computing.  And the less time moving data (for a fixed 
window of time).  Which lowers your effective bandwidth.

Of course, that isn't the only thing you have to worry about. You have 
to do a hash table lookup as well for dedup.  While hash table lookups 
are pretty efficient, you may need to queue up a hash table insertion if 
the block wasn't found.  And you will need lots of ram to hold these 
hash tables.  Parallel hash table insertion is possible.  Parallel 
lookup is possible.  Get N machines operating on one query should be N 
times faster if you can partition your table N ways.  Get M simultaneous 
queries going, with table inserts, etc. ... well, maybe not so much fast.

This is part of the reason that Dedup is used in serial backup pathways, 
usually with dedicated controller boxes.  You can leverage software to 
do this, but you are going to allocate lots of resources to get the hash 
table computation, lookup and table management going fast.

My point was that its hard to do this on the fly in general.  You are 
right that some aspects can be easilydone in parallel.  But there are 
elements that can't be, and the hash computations are still expensive on 
modern processors.

Haven't tried it on the Nehalem yet, but I don't expect much difference.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615