Watch the recorded event at: http://info.datameer.com/Slideshare-
Economics-SQL-Hadoop.html
As organizations clamor to utilize their new investments in Hadoop ecosystems AND leverage their existing analytical infrastructures, many rush to integrate SQL as a data access layer to leverage existing skill sets and get started faster.
However, this approach relegates Hadoop to a data management and processing platform rather than the storage and compute engine optimized for analytical workloads it was purpose-built to be.
These slides by EMA and Datameer, will discuss the technical limitations of SQL on Hadoop and propose alternative ways to fully maximize Hadoop investments.
You will understanding:
*how SQL negates the inherent benefits of Hadoop
*why technological paradigm changes can sometimes be good
*use cases when SQL on Hadoop makes sense
According to 2012 EMA research, Online Archiving, or Hadumping, is the Phase “zero” of most Big Data initiatives
Teaches Internal teams about the data delivery and structure
How to interact with the data
How to apply data to business cases as opposed to simply a technology project
It is the where you start when:
“you don’t know what you don’t know…”
2013 EMA Research shows that over half of Big Data projects have online archiving as an ‘In Operation’ status
In Production or as a Pilot Project with hands on keyboards. Software installed.
Over 4 in 10 respondents say “Economics” are a Business Reason for Online Archiving Use Case.
These organizations are attempting to lower their operational costs
Moving beyond select * requires a standard requires a facility that manages and tracks metadata
Select * tablename is the rough equivalent to cat filename
SQL starts to become truly “special” when you use a query such as
Select t.columnA, s.columnB, s.columnC from tablename t tablename s
Where t.columnZ = s.column.X
NoSQL and specifically Hadoop have focused on the ability to be flexible in data storage often at the expense of metadata management
SQL doesn’t do with an “or” data structure (image on right)
SQL works best with a defined data structure (image on right)
When you ask Hive a question it doesn’t understand…. You get the error message.
In2013 EMA Research Big Data initiatives used the following datasets
Machine generated (JSON, XML, etc) almost 40%
Process mediated (structured) just under 30%
Human sourced (emails, texts,) over 30%
Over 30% of respondents indicate that a lack of self-service data access (SQL) is a challenge to operate a Hadoop platform
Nearly 40% of respondents say a lack of SQL data access is a challenge to operate a NoSQL platform
In each of these instances, it indicates that while you “CAN” perform certain applications on Hadoop, SQL-based data access is a high concern.
Big Data environments aren’t just for EDW replacement as some would say
There are multiple use cases
Operational
Analytical
Exploratory
Nearly 3 of 10 respondents in 2013 research say that they are using Exploratory or Discovery use cases
Just under 50% of respondents say operational costs (staff head count is included) are a challenge to operate a discovery platform.
3 of 10 respondents want to utilize the features and functions of products to speed their skills acquisition. Often times these are features that they feel most comfortable with. Interfaces and processes that they use every day. MS Excel is an example.
Nearly 4 out 10 respondents indicate new skills development is a challenge to operate a discovery platform
When you are using exploratory or discovery use cases, you need flexibility… applying a hard schema (structured) presupposes particular questions AND answers.
Square wooden peg and round wooden hole – not a lot of give.
Being able to apply a schema or structure at the time of query or late binding schema enables the best method of discovery
Flexible schema at the time of processing…. Sausage grinder
2013 EMA research says
Over 30% of respondents use late binding schemas when processing data
Nearly a third use multiple approaches
Over 10% don’t apply a schema at all…
“Only” about one third of Respondents are using external technical resources to bridge their skills gaps. This comes from the costs associated with the outside consultants vs existing staff
“Free as in Speech” or “Free as in Beer”… Big Data is “Free as a Free Puppy”
Over 40% of respondents say Economics are a Business Reason for Online Archiving Use Case
Back to Metadata….
Over one third of respondents indicate shortage of technical metadata a challenge to operate a discovery platform. Applying that technical metadata layer takes a manual effort and thus additional headcount. When you link this to ‘only’ a 1% increase in big data budget from 2013 to 2014 for Hadoop implementations, it is important to put the best use for hadoop platforms.
36% implementation time to implement is a challenge to operate a hadoop platform
43% say operational costs are a challenge to operate a discovery platform (link to a 1% increase in big data operational budget from 2013 to 2014)
Over one third of respondents say they lack the skills to manage multi-structured data platforms as an obstacle to implement (Top answer)