[Beowulf] dedupe filesystem
Joe Landman
landman at scalableinformatics.com
Sun Jun 7 21:55:10 PDT 2009
Lawrence Stewart wrote:
[...]
>> Yup this is it, but on the fly is the hard part. Doing this
>> comparison is computationally very expensive. The hash calculations
>> are not cheap by any measure. You most decidedly do not wish to do
>> this on the fly ...
The assumption of a high performance disk/file system is implicit here.
>>
>>> And for that, it’s all about trading between cost of storage space,
>>> retrieval time, and computational effort to run the algorithm.
>>
>> Exactly.
>
>
> I think the hash calculations are pretty cheap, actually. I just timed
> sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, on
> one core (from the disk cache). That is substantially faster than the
> disk transfer rate. If you have a parallel filesystem, you can
> parallize the hashes as well.
Disk transfer rates are 100-120 MB/s these days. For high performance
local file systems (N * disks), the data rates you showed for sha1 won't
cut it. Especially if they have multiple hash signatures computed in
order to avoid hash collisions (ala MD5 et al). The probability of two
different hashes having two or more unique blocks that have a collision
in the same manner is somewhat smaller than the probability that a
single hash function has two (or more) unique blocks that have a
collision. So you compute two or more in different ways, to reduce the
probability of that collision.
For laughs, I just tried this on our lab server. We have a 500 MB/s
file system attached to it, so we aren't so concerned about data IO
bottlenecks.
First run, not in cache:
landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat
CentOS-5.3-x86_64-bin-DVD.iso |sha1sum
0.02user 12.88system 0:47.22elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
8901280inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0 -
second run, in cache:
landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat
CentOS-5.3-x86_64-bin-DVD.iso |sha1sum
0.01user 8.79system 0:37.47elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0 -
The null version of the hash, just writing to /dev/null:
landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat
CentOS-5.3-x86_64-bin-DVD.iso | cat - > /dev/null
0.00user 8.37system 0:10.96elapsed 76%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps
So ~11 seconds of the results here are dealing with the pipes and
internal unix bits.
Removing the pipe
landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat
CentOS-5.3-x86_64-bin-DVD.iso > /dev/null
0.00user 4.95system 0:04.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps
so its about 6 seconds cost to use the pipe when data is in cache,
and the raw filesystem w/o cache
landman at crunch:/data/big/isos/centos/centos5$ dd
if=CentOS-5.3-x86_64-bin-DVD.iso of=/dev/null iflag=direct bs=32M
135+1 records in
135+1 records out
4557455360 bytes (4.6 GB) copied, 8.72511 s, 522 MB/s
Where I am getting at with this, is every added layer costs you time,
whether or not the data is in cache or on disk. Anything getting in
path of moving data to disk, every pipe, any hash/checksum you compute
is going to cost you. The more hashes/checksums you compute, the more
time you spend computing. And the less time moving data (for a fixed
window of time). Which lowers your effective bandwidth.
Of course, that isn't the only thing you have to worry about. You have
to do a hash table lookup as well for dedup. While hash table lookups
are pretty efficient, you may need to queue up a hash table insertion if
the block wasn't found. And you will need lots of ram to hold these
hash tables. Parallel hash table insertion is possible. Parallel
lookup is possible. Get N machines operating on one query should be N
times faster if you can partition your table N ways. Get M simultaneous
queries going, with table inserts, etc. ... well, maybe not so much fast.
This is part of the reason that Dedup is used in serial backup pathways,
usually with dedicated controller boxes. You can leverage software to
do this, but you are going to allocate lots of resources to get the hash
table computation, lookup and table management going fast.
My point was that its hard to do this on the fly in general. You are
right that some aspects can be easilydone in parallel. But there are
elements that can't be, and the hash computations are still expensive on
modern processors.
Haven't tried it on the Nehalem yet, but I don't expect much difference.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf
mailing list