[Beowulf] Interesting POV about Hadoop

Tue Jun 3 08:55:03 PDT 2014

On 06/03/2014 11:32 AM, Lockwood, Glenn wrote:

> What the article seems to slide past is that not all data fits neatly
> into column-oriented or graph databases.  Hadoop is all about
> handling unstructured data, so its real utility lies in ETL, not the
> analytics part.  It's a complementary capability to all these
> databases that he's saying are better than MapReduce.  Yes, databases
> are better at doing database things than non-databases.  The bigger
> problem is getting data into a structure where they'll fit into those
> databases, and you often can't just use another database to do that.

One of the more interesting things about this, is this ETL issue, can be 
not only a non-trivial cost in time/effort, but it can be the rate 
limiting step for some types of analysis.

This said, hadoop isn't about "handling" unstructured data, its about 
providing an infrastructure for processing this unstructured data.  The 
reduced ETL is part of the processing step, one often (massively) 
overlooked in many projects.  Not just database and analytics, but 
really, *all* projects.

The real costs to make use of a technology often impose some interesting 
constraints on analytics.  SQL and SQL-like things require some level of 
structure applied at some level, even if its hidden from the user. 
Normalizing databases costs processing and IO time.

The important question is, in the bigger picture, is the cost of using 
the capability worth the effort?  And the comparison is to the null 
hypothesis (e.g. don't change anything).

The cost to use GPUs/accelerators are multifold, but the benefits, for 
properly structured code, can be quite large.  The cost is rewriting the 
code, the testing time/effort.

The cost to use SQL/SQL-like things is the ETL, the system design, etc. 
  The benefits are many, but these costs are potentially very high for 
unstructured data.

The cost to use Hadoop and other MapReduce technologies is, again, code 
changes and overall design issues (few ACID DBs on the noSQL side, 
though there are some).  The benefit is (nearly) embarrassingly parallel 
performance (great for certain use cases).

The SQL folks are adding capabilities such as JSON elements (PostgreSQL, 
etc.).  The noSQL folks are adding SQL-like front ends for interactive 
bits.  There are distributed SQL engines (our friends at XtremeData, and 
others).

Basically I am noting that some aspects of the functionality are 
converging.

But the cost of using a particular technology is in all elements of the 
technologies implementation, and not just its core.  If a SQL engine can 
do a terrific job on an analysis that takes 100+ hours to ETL and load 
up, then do indexing, joins, scans, queries, while the equivalent less 
formalized Hadoop-like mapreduce engine takes 10 hours to bring data in 
and less than 10 hours to query ... yeah, that's not going to make a 
strong case for the SQL engine.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615