[Beowulf] Interesting POV about Hadoop
Joe Landman
landman at scalableinformatics.com
Tue Jun 3 08:55:03 PDT 2014
On 06/03/2014 11:32 AM, Lockwood, Glenn wrote:
> What the article seems to slide past is that not all data fits neatly
> into column-oriented or graph databases. Hadoop is all about
> handling unstructured data, so its real utility lies in ETL, not the
> analytics part. It's a complementary capability to all these
> databases that he's saying are better than MapReduce. Yes, databases
> are better at doing database things than non-databases. The bigger
> problem is getting data into a structure where they'll fit into those
> databases, and you often can't just use another database to do that.
One of the more interesting things about this, is this ETL issue, can be
not only a non-trivial cost in time/effort, but it can be the rate
limiting step for some types of analysis.
This said, hadoop isn't about "handling" unstructured data, its about
providing an infrastructure for processing this unstructured data. The
reduced ETL is part of the processing step, one often (massively)
overlooked in many projects. Not just database and analytics, but
really, *all* projects.
The real costs to make use of a technology often impose some interesting
constraints on analytics. SQL and SQL-like things require some level of
structure applied at some level, even if its hidden from the user.
Normalizing databases costs processing and IO time.
The important question is, in the bigger picture, is the cost of using
the capability worth the effort? And the comparison is to the null
hypothesis (e.g. don't change anything).
The cost to use GPUs/accelerators are multifold, but the benefits, for
properly structured code, can be quite large. The cost is rewriting the
code, the testing time/effort.
The cost to use SQL/SQL-like things is the ETL, the system design, etc.
The benefits are many, but these costs are potentially very high for
unstructured data.
The cost to use Hadoop and other MapReduce technologies is, again, code
changes and overall design issues (few ACID DBs on the noSQL side,
though there are some). The benefit is (nearly) embarrassingly parallel
performance (great for certain use cases).
The SQL folks are adding capabilities such as JSON elements (PostgreSQL,
etc.). The noSQL folks are adding SQL-like front ends for interactive
bits. There are distributed SQL engines (our friends at XtremeData, and
others).
Basically I am noting that some aspects of the functionality are
converging.
But the cost of using a particular technology is in all elements of the
technologies implementation, and not just its core. If a SQL engine can
do a terrific job on an analysis that takes 100+ hours to ETL and load
up, then do indexing, joins, scans, queries, while the equivalent less
formalized Hadoop-like mapreduce engine takes 10 hours to bring data in
and less than 10 hours to query ... yeah, that's not going to make a
strong case for the SQL engine.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615
More information about the Beowulf
mailing list