[Beowulf] dedupe filesystem

Wed Jun 3 03:10:05 PDT 2009

2009/6/3 Bogdan Costescu <Bogdan.Costescu at iwr.uni-heidelberg.de>:
> > I beg to differ, at least in the academic environment where I come from.
> Image these 2 scenarios:
>
> 1. copying between users
>
You make good points, and I agree.

>        > 2. copying between machines
>
>   Data is stored on a group file server or on the cluster where is
>   was created, but needs to be copied somewhere else for a more
>   efficient (mostly from I/O point of view) analysis. A copy is made,
>   but later on people don't remember why the copy was made and if
>   there was any kind of modification to the data. Sometimes the
>   results of the analysis (which can be very small compared with the
>   actual data) are stored there as well, making the whole set look
>   like a "package" worthy of being stored together, independent of
>   the original data. This "package" can be copied back (so the two
>   copies live in the same file system) or can remain separate (which
>   can make it harder to detect as copies).

This is something we should explore on this list.

Quite often the architecture of storage is a secondary consideration,
in the rush to get a Shiny New Fast machine on site and working.

In HPC, there are a lot of advantages in a central clustered
filesystem, where you can prepare your input data,
run the simulation, then at the end visualize the data.

I do agree with you that there are situations where you transfer the
data to faster storage before running on it -
I am thinking on one particular case right now!
I Also agree with you that you then have the danger of 'squirreling
away' copies on the fast storage, and forgetting why they are there.
The systems administrator must put in place strong policies on this -
leave your data on the fast storage, it gets deleted after N weeks.

>
> I do mean all these in a HPC environment - the analysis mentioned before can
> involve reading multiple times files ranging from tens of GB to TB (for the
> moment...). Even if the analysis itself doesn't run as a parallel job,
> several (many) such jobs can run at the same time looking for different
> parameters. [ the above scenarios actually come from practice - not
> imagination - and are written with molecular dynamics simulations in mind ]
>
> Also don't forget backup - a HPC resource is usually backed up, to avoid
> loss of data which was obtained with precious CPU time (and maybe an
> expensive interconnect, memory, etc).
>
> --
> Bogdan Costescu
>
> IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
> Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
> E-mail: bogdan.costescu at iwr.uni-heidelberg.de
>