[Beowulf] dedupe filesystem
Bogdan Costescu
Bogdan.Costescu at iwr.uni-heidelberg.de
Wed Jun 3 20:33:32 PDT 2009
On Wed, 3 Jun 2009, John Hearns wrote:
> Quite often the architecture of storage is a secondary consideration,
> in the rush to get a Shiny New Fast machine on site and working.
Well, I've seen it ignored even outside of that rush - in the design
phase. And I confess of being guilty of doing this as well, but I
learn from mistakes :-)
> In HPC, there are a lot of advantages in a central clustered
> filesystem, where you can prepare your input data, run the
> simulation, then at the end visualize the data.
In theory, this sounds nice, but in practice it can prove to be a bit
more difficult, most times the human factor being the main culprit
(just like with the scenarios I presented earlier). Administrative
issues (who owns what) can seriously affect the possibility of
coupling the HPC and visualisation resources, leading often to
duplication of data. Stupid sysadmins or policies can leave the HPC
resource with very basic text editors or terminal settings, leading
the users to create the input set on their own workstation and
constantly copying it over. Only the actual running of the simulation
can be tightly linked to the clustered file system...
> The systems administrator must put in place strong policies on this -
> leave your data on the fast storage, it gets deleted after N weeks.
I see duplication of data in almost all cases as a human behaviour
problem, not a technical one, which needs human behaviour solutions
and not technical ones, so policies are a good solution. But I would
argue that users' education is an even better solution - teach them
why copying of data is bad and give them easy ways of safely sharing
data with their collaborators and not only they will keep the file
systems emptier but they will also thank you for the decrease in
effort needed to manage the always increasing amounts of data. Such a
solution is however only feasible for smaller groups - f.e. an HPC
center offering services to several (many) universities won't be able
to convince all its users to take time and think about data management
and a virtual sucker rod is not as efficient as a real one, so
policies would still be required...
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de
More information about the Beowulf
mailing list