[Beowulf] dedupe filesystem

Mark Hahn hahn at mcmaster.ca
Fri Jun 5 06:52:55 PDT 2009


>> have tiered storage today, but in the future i can see a need to have
>> a storage pool with SATA and a storage pool with SAS or faster drives
>> in it.

IMO, this is a dubious assertion.  I bought a couple incredibly cheap
desktop disks for home use a couple weeks ago: just seagate 7200.12's.
these are of the latest 500G/platter generation, so have the high density
and thus bandwidth:
http://www.sharcnet.ca/~hahn/7200.12.png

sure, your application may require low-latency.  but bandwidth is easy.

>> Some of the researchers where I am, work on data for months.

my organization's current policy is to be fairly stingy with /home and /work,
neither of which have any timeouts.  /scratch currently has a 1-month timeout,
which unfortunately tends to be too short to encourage use.

>> Is this something better solved with pre/post-amble copies or through
>> policies?

we currently have a periodic crawler that collects data on each filesystem:
hashing each file to avoid people gaming timeouts with touch.

> The best of both worlds would certainly be a central, fast storage filesystem,
> coupled with a hierarchical storage management system.

I'm not sure - is there some clear indication that one level of storage is 
not good enough?

> Oh wait, it might exist already... Well, at least it's in the works: Sun and
> CEA are working on implementing such an HSM for Lustre 2.0. See
> http://wiki.lustre.org/images/8/8b/AurelienDegremont.pdf for details.

this seems like a bad design to me.  I would think (and I'm reasonably
familiar with Lustre, though not an internals expert) that if you're going to 
touch Lustre interfaces at all, you should simply add cheaper, higher-density
OSTs, and make more intelligent placement/migration heuristics.  I guess that 
CEA already has a vast investment in some existing HSM, so can't do this.

regards, mark hahn



More information about the Beowulf mailing list