[Beowulf] dedupe filesystem

Lux, James P james.p.lux at jpl.nasa.gov
Fri Jun 5 09:55:35 PDT 2009


Isn't de-dupe just another flavor, conceptually, of a journaling file system..in the sense that in many systems, only a small part of the file actually changes each time, so saving "diffs" allows one to reconstruct any arbitrary version with much smaller file space.
I guess the de-dupe is a bit more aggressive than that, in that it theoretically can look for common "stuff" between unrelated files, so  maybe a better model is a  "data compression" algorithm on the fly.  And for that, it's all about trading between cost of storage space, retrieval time, and computational effort to run the algorithm.  (Reliability factors into it a bit.. Compression removes redundancy, after all, but the defacto redundancy provided by having previous versions around isn't a good "system" solution, even if it's the one people use)

I think one can make the argument that computation is always getting cheaper, at a faster rate than storage density or speed (because of the physics limits on the storage...), so the "span" over which you can do compression can be arbitrarily increased over time. TIFF and FAX do compression over a few bits. Zip and it's ilk do compression over kilobits or megabits (depending on whether they build a custom symbol table).  Dedupe is doing compression over Gigabits and terabits, presumably (although I assume that there's a granularity at some point.. A dedupe system looks at symbols that are, say, 512 bytes long, as opposed to ZIP looking at 8bit symbols, or Group4 Fax looking at 1 bit symbols.

The hierarchical storage is really optimizing along a different axis than compression.  It's more like cache than compression.. Make the "average time to get to the next bit you need" smaller rather than "make smaller number of bits"

Granted, for a lot of systems, "time to get a bit" is proportional to "number of bits"

On 6/5/09 8:00 AM, "Joe Landman" <landman at scalableinformatics.com> wrote:

John Hearns wrote:
> 2009/6/5 Mark Hahn <hahn at mcmaster.ca>:
>> I'm not sure - is there some clear indication that one level of storage is
>> not good enough?

I hope I pointed this out before, but Dedup is all about reducing the
need for the less expensive 'tier'.  Tiered storage has some merits,
especially in the 'infinite size' storage realm.  Take some things
offline, leave things you need online until they go dormant.  Define
dormant on your own terms.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090605/b9926798/attachment.html>


More information about the Beowulf mailing list