[Beowulf] dedupe filesystem

Joe Landman landman at scalableinformatics.com
Fri Jun 5 08:00:09 PDT 2009

John Hearns wrote:
> 2009/6/5 Mark Hahn <hahn at mcmaster.ca>:
>> I'm not sure - is there some clear indication that one level of storage is
>> not good enough?

I hope I pointed this out before, but Dedup is all about reducing the 
need for the less expensive 'tier'.  Tiered storage has some merits, 
especially in the 'infinite size' storage realm.  Take some things 
offline, leave things you need online until they go dormant.  Define 
dormant on your own terms.

> That is well worthy of a debate.

Tiered makes sense in the sense of HSMs.  Not so much (for HPC ... and 
increasingly for business).

> As the list knows, I am a fan of HSMs - for the very good reason of
> having good experience with them.
> There are still arguments made that 'front line tier = fast
> SCSI/fibrechannel disk' 'second line and lwoer tier = SATA' and the
> sales
> types say SATA is slower and less reliable.

These arguments are still being made by many folks with vested interests 
in the expensive FC solutions.  This is where Dedup plays.  Reduce the 
need for the second tier, and you will get less pressure to drop your 

The added benefit is that backups should take less time, DR can take 
less time.  And fewer of those meddling smaller storage vendors with big 
and honking fast disks need be around their turf ...

> Mark, you make the very good point that the world is changing (or
> indeed has changed) and you should be looking at an infinitely
> expandable disk based setup - just add more disks into the slot, more
> JBODs, whatever.

Yeah, change happens.  Those who resist inevitable change will be on the 
dustheap in short order, if their business model/requirements don't adapt.

The worlds largest data repository doesn't do dedup, or use 'tiered' 
storage.  Rather they embrace duplication (n-plication actually), and 
'flat' single tiers.  There is a reason for this, and it is driven by 
cost and performance.

> Actually, as a complete aside here I have been lookign at Virtual Tape
> Libraries. One of the Spectralogic models actually eats SATA drives
> just liek they are tapes -

I had read about this unit.  Will have to speak with them at some point :)

> I'm now going to counter your argument - let's say we have an
> expensive parallel filestore such as Panasas.  Or maybe Lustre.
> So your researchers work on a new project, and need new storage. But
> they have old projects lying around.
> They argue they might revisit them, they might need this data, someone
> might take on a PhD student to trawl through it,

Yeah ... when I started in my studies, I reworked older calcs, and 
revisited 1-5 year old data.  Having easy access to this data is 
important.  Access should be transparent.  Don't mind waiting a short 
while (seconds) for the initial access, but subsequent needs to be fast.

> or you are in the movie business and your movie has premiered yet
> there is a directors cut scheduled for next year...

That is one of the markets we are looking at.

> OK, so you can add more Panasas. Cue salesman buying in a large bucket
> of glee to rub his hands in.
> I agree may argument holds less water with Lustre.

Look at it this way ... what is the cost/benefit to the movie-company to 
buy/build expensive storage and build tiers, as compared to much less 
expensve replicated/HSMed storage?  I think the writing is clearly on 
the wall on this.  Lots of the folks in this industry will disagree, but 
follow what the customers are actually doing.


ps:  [commmercial content] We have a stake in this stuff, so standard 
bias disclaimers apply to my post.  See today's InsideHPC for more ... 
http://insidehpc.com [/commercial content]

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list