[Beowulf] are there any known attempts to apply hadoop BigData techniques to weather modelling?
Ellis H. Wilson III
ellis at cse.psu.edu
Tue Feb 17 14:16:53 PST 2015
On 02/17/2015 04:56 PM, Prentice Bisbal wrote:
> Why do you think 'Big Data' techniques would be applicable to this?
>
> A large amount of data != big data.
Heh. Let's not pretend like 'big data' means anything of substance now :D.
> 'Big Data' techniques are typically for finding trends in unstructured
> data from multiple sources, whereas the output of scientific simulations
> is usually from a single source in some sort of structured format. I
> just don't see any applicability here whatsoever.
I would argue this is perhaps a bit overly specific. This might be the
typical use case, but certainly there is no reason why Hadoop and
MapReduce couldn't be used to do simple filtering of scientific
simulation output. If you were looking for places in a huge output file
where temperature is between some set of ranges and elevation also had a
specific value, I could certainly see value in applying an easily
programmable scaling framework to basically "smart grep" through your
data. Hadoop/MR could certainly help you do that.
Many output formats for scientific data are well-structured as you
mentioned however, such as HDF5. This doesn't mean you have a good file
system or good parallel programming paradigm to do stupid-simple things
with this afterwards. You just have a good container format. Hadoop
could provide the other bits you need. A paper from the HDF5 group
actually does a decent job of pointing out these kinds of differences,
how you might get HDF5 containers in and out of HDFS and what impacts
performance:
http://www.hdfgroup.org/HDF5/faq/hadoop.html
As they note in the paper, a recent work (I was lucky enough to talk in
the same slot as the author at SC a year back) called SciHadoop works
directly with NetCDF formatted files, so that could be another option.
Whether or not the source is available for SciHadoop is beyond my
knowledge, but a quick google would likely give you that answer.
If you are asking, "should I do weather simulation using Hadoop or some
other big data framework," my answer is a resounding NO. There are VERY
different (often far more limited) semantics and guarantees in MR than
other parallel programming paradigms, and you will almost certainly get
burned if you try to shove a climate-shaped peg through the square hole
that is MR. This is probably what Prentice was getting at.
Best,
ellis
More information about the Beowulf
mailing list