[Beowulf] dedupe filesystem

Fri Jun 5 10:12:39 PDT 2009

Lux, James P wrote:
> Isn’t de-dupe just another flavor, conceptually, of a journaling file 
> system..in the sense that in many systems, only a small part of the file 
> actually changes each time, so saving “diffs” allows one to reconstruct 
> any arbitrary version with much smaller file space.

Its really more conceptually like RLE (run length encoding) or simple 
compression where you start with a pattern and a dictionary, and point 
out where in the file that pattern repeats.

> I guess the de-dupe is a bit more aggressive than that, in that it 
> theoretically can look for common “stuff” between unrelated files, so 

It only looks at raw blocks.  If they have the same hash signatures 
(think like MD5 or SHA ... hopefully with fewer collisions), then they 
are duplicates.

>  maybe a better model is a  “data compression” algorithm on the fly. 

Yup this is it, but on the fly is the hard part.  Doing this comparison 
is computationally very expensive.  The hash calculations are not cheap 
by any measure.  You most decidedly do not wish to do this on the fly ...

>  And for that, it’s all about trading between cost of storage space, 
> retrieval time, and computational effort to run the algorithm. 

Exactly.

>  (Reliability factors into it a bit.. Compression removes redundancy, 
> after all, but the defacto redundancy provided by having previous 
> versions around isn’t a good “system” solution, even if it’s the one 
> people use)

:)

You get a direct CBA comparison between buying the N+1th disk, and the 
time/effort/money to perform this computation.  In the end, the former wins.

> I think one can make the argument that computation is always getting 
> cheaper, at a faster rate than storage density or speed (because of the 
> physics limits on the storage...), so the “span” over which you can do 
> compression can be arbitrarily increased over time. TIFF and FAX do 
> compression over a few bits. Zip and it’s ilk do compression over 
> kilobits or megabits (depending on whether they build a custom symbol 
> table).  Dedupe is doing compression over Gigabits and terabits, 
> presumably (although I assume that there’s a granularity at some point.. 
> A dedupe system looks at symbols that are, say, 512 bytes long, as 
> opposed to ZIP looking at 8bit symbols, or Group4 Fax looking at 1 bit 
> symbols.

Most Dedup are over blocks, and I think most are doing 512 bytes or 4k 
bytes.

The point is that even if theoretically computation is getting cheaper, 
hash computations (the ones without collisions, as collisional hashes 
are ... um ... not good for Dedup), the calculation of the hash is still 
a significant bit of computation.

One well suited for an accelerator.  Which is why the Dedup market seems 
to be "flooded" with accelerators (which I think are little more than 
FPGAs implementing some hash computation algorithm)

> The hierarchical storage is really optimizing along a different axis 
> than compression.  It’s more like cache than compression.. Make the 
> “average time to get to the next bit you need” smaller rather than “make 
> smaller number of bits”

Basically yes ... though HSM is all about driving down the cost of the 
large pool as low as possible.  Tape is still used, and lots of people 
make arguments for tape.  But as John pointed out Spectra logic is 
marketing a SATA eating robot, so I think the days of tape are likely 
more numbered than before.

A brief anecdote.  In 1989, a fellow graduate student was leaving for 
another school.  He was taking his data with him.  He spooled up a Vax 
8650 unit with a tape.  I asked him why this over other media.  His 
response was, you can read a Vax tape anywhere.

In 2009, twenty years later, I think he might have a different take on 
this.  I put all my bits onto floppys when I left there, and moved the 
important ones to spinning rust.  I can still read the floppies.  I 
doubt he can still read the tapes.

The point is that tape folks talk about longevity.  But this makes a 
number of important assumptions about the media, the drives, and 
availability of replacement drives, which, as my advisor in graduate 
school discovered after her drive died, are not necessarily correct or 
accurate.

> Granted, for a lot of systems, “time to get a bit” is proportional to 
> “number of bits”

Yup.  But that initial latency can be huge.

While the cost of computation is decreasing rapidly, I'll argue that the 
cost of storage is decreasing as fast if not faster.  This has 
implications for which mode is preferable ... n-plication onto 
decreasing cost media, or computation to minimize the cost of the ... 
cheap media footprint.  The CBA doesnt favor Dedup in the long term, and 
though does favor HSM ... even cloud storage.

The issues there are bandwidth, bandwidth, and, you guessed it, bandwidth.

> 
> On 6/5/09 8:00 AM, "Joe Landman" <landman at scalableinformatics.com> wrote:
> 
>     John Hearns wrote:
>     >  2009/6/5 Mark Hahn <hahn at mcmaster.ca>:
>     > > I'm not sure - is there some clear indication that one level of
>     storage is
>     > > not good enough?
> 
>     I hope I pointed this out before, but Dedup is all about reducing the
>     need for the less expensive 'tier'.  Tiered storage has some merits,
>     especially in the 'infinite size' storage realm.  Take some things
>     offline, leave things you need online until they go dormant.  Define
>     dormant on your own terms.
> 

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615