Hybrid Data Warehouse Hadoop Implementations

The future of hybrid
Data Warehouse-Hadoop
implementations
-

-

David Portnoy
Datalytx, Inc.
312.970.9740
http://LinkedIn.com/in/DavidPortnoy
-

-

-

© Copyright 2013 David Portnoy and Datalytx, Inc.

-

-

-

Why this topic?

Note on
terminology used
In this context
RDBMS (relational
databases) are
synonymous with DW
(data warehouses)



Data Warehouse vendors are evolving to incorporate the best
Hadoop has to offer. Similarly, the Hadoop ecosystem is growing
to include capabilities previously available only to large scale (MPP)
DW platforms.



Understanding the trends and alternatives helps your organization identify the most
effective long term solution



Launched in TDWI forum on LinkedIn

(See http://bit.ly/Hybrid-DW-Hadoop)

Who are the winners in the race for the ultimate hybrid
DBMS-Hadoop implementation?
As described in http://bit.ly/Hybrid-RDBMS-Hadoop, the
industry is going to a hybrid DBMS-Hadoop model that
leverages the best of both worlds. (Microsoft for example is
building its Polybase with cost based query optimization that
decides whether to push processing to the Hadoop data
nodes or the PDW compute nodes.)
Which vendors do you see as the current leaders in this
race? And for the visionaries and philosophers among you...
How do you see it ultimately shaking out?

This group extends the TDWI
community online and is designed to
foster peer network and discussion
of key issues relevant to business
intelligence and data warehousing
managers.
TDWI (The Data Warehousing
Institute ™) provides education,
training, certification, news, and
research for executives and
information technology (IT)
professionals worldwide. Founded in
1995, TDWI is the premier
educational institute for business
intelligence and data warehousing.
Our Web site is www.tdwi.org.

Relationship between Hadoop & DW implementations
To leverage the strengths in each platform, traditionally...


Hadoop is used for storage & transformation (specifically,
ELT) of vast volumes of raw data



...while DW is used for analytics on a subset of the
processed data

Extract from
source systems

Load to DW

Transform
in place

DW

Reporting &
OLAP using
traditional tools

Why do DW vendors care about Hadoop?
...And why not just ignore it as a special use case solution?
1. Compelling price point to store high volume data, especially

if it’s not needed for real-time access
2. Has become the de facto standard for ambitious big data

projects
3. HDFS and MapReduce are becoming mature and stable

technologies (in development since 2005), despite the fact
that the rest of the ecosystem is still rapidly evolving
4. DBMS vendors have been missing a scalable distributed

file system, such as HDFS, to provide capability to store
and manipulate variable and unstructured data

Do companies with an MPP DW still need Hadoop?
For companies that have an MPP (Massively Parallel Processing) data
warehouse, such as Teradata or Netezza...
Couldn’t the DW platform do everything Hadoop could do?

Yes and no.
1.

Yes, you can store a lot of unstructured data in text fields, simulate
operations of key-value pairs, and scale processing capacity
horizontally, just like Hadoop

2.

But the types of processing that can be done against the DW is more
limited than with Hadoop. (Although some DW vendors now allow
open source tools, including MapReduce and R, to crunch the data.)

3.

And the cost of storing high volumes of data – especially for low usage
frequency and high latency operations – is much lower for Hadoop. So
adding Hadoop to the mix can keep the size of the more expensive DW
platform in check.

What paths are DW vendors taking?
DW vendors can choose from 3 typical paths
for responding to competing technologies
1.

Ignore: Hope it’s a fad that goes away

None of the major vendors
see this as a viable option

2.

Reactionary: Interoperate with existing
Hadoop products

This seems to be the most
common path for established
commercial vendors

3.

Proactive: Embrace Hadoop and
contribute to extending the ecosystem

This seems to be the
approach for new entrants
competing in a targeted
niche

What will happen to major DW vendors?
It’s safe to say that major commercial vendors like
IBM, Oracle, Teradata and Microsoft
will continue to be key players.

1.

Each one already has a product roadmap involving some way of
responding to or incorporating the Hadoop ecosystem.

2.

Many of the most successful new entrants will be acquired by these
vendors and incorporated into their product lines.

3.

This pattern can me seen in the historical evolution of revolutionary or
niche technologies, such as columnar databases, in-memory
databases, and self-service BI capabilities.

Partnerships & Alliances
Most large commercial DBMS vendors have partnerships with
specific Hadoop distribution developers.
Oracle
(Big Data Appliance)
IBM Netezza

Microsoft
(HDInsight)

Teradata &
AsterData appliance

Greenplum
(Prior to GreenplumHD)

Possible phases of a DW platform to evolve into a hybrid
There are many ways to get to the end goal, but here’s a possible evolution
path for commercial DW vendors to hybrid DBMS-Hadoop solutions
Phase

Description

Independent

DW and Hadoop operate independently, storing completely different data
sets depending on size and structure and processing them in completely
different ways.

Batch data
movement

There’s an efficient method to shuttle data back and forth between DW
and Hadoop. Focused primarily on loading a subset of data into the DW
for analytics. Transformations typically happen in context of ELT, rather
than ETL.

Integrated
storage

Queries are issued against the system, which in turn determines where
the requested data resides – DW or Hadoop. If data resides on Hadoop,
it pulls data into DW prior to executing a query against it.

Optimized
processing in
place

Queries are issued against the system, which in turn determines where
the requested data resides – DW or Hadoop. If data resides on Hadoop,
the query is executed in place within Hadoop, possibly after converting
the logic to MapReduce. The result is brought back to the DW.

How about Hadoop distribution vendors?
Hadoop distribution vendors like
Cloudera, Hortonworks and MapR
are developing products to add DW capabilities
and bridge the gap between the two worlds
Their solution:
 Improve on limitations of MapReduce
 Improve on data silos and overhead of moving data
between DW and Hadoop

What’s wrong with MapReduce?
MapReduce is the legacy and most pure processing environment for
Hadoop. But it’s not ideal for a number of reasons


Performance: Long lag between query and results make it difficult for
interactive analytics



Requires high degree of skill with Java for processing data. Existing
resources with SQL skills would be underutilized.



Doesn’t leverage existing company investment in ETL, reporting,
analytics tools

Why not use DW-Hadoop connectors?
The original approach of integrating DW with Hadoop using connectors to
move data back and forth is not ideal. It introduces costs and inefficiencies
of dealing with data silos. To solve this problem, a more practical “SQL-onHadoop” architecture is being adopted.
Typically, SQL-on-Hadoop capabilities
include:
 Interactive analytical queries (readonly)



Parallelism / distributed processing



Efficient joins across multiple tables



ANSI SQL Compliance



Query caching



Ability to use existing ETL, OLAP and
reporting tools from commercial
vendors

SQL-on-Hadoop players:
 Cloudera’s Impala
 Hortonworks’ Stinger (faster
Hive via ORCfile & Tez)
 Apache Drill (supported by MapR)
 Hadapt / HadoopDP
 Greenplum’s HAWQ
(on Pivotal HD)
 Teradata’s SQL-H (on
Aster/PostgreSQL)

DW & Hadoop vendors approach from opposite directions
Relational

DBMS

Add
interactivity

Hybrid

DW vendors start with their
relational DBMS platform...
and add interactivity to Hadoop





Add DBMS
features

Initially all processing might occur
on DBMS, with Hadoop being used
for storage
Ultimately evolves to pushing
processing to Hadoop cluster
Examples: Microsoft PDW /
Polybase, Greenplum / Pivotal HD

Hadoop vendors start with their
Hadoop distribution...
and add DBMS features




This is also known as “SQL-onHadoop”
Add query optimizer, real-time
capabilities, etc.
Examples: Cloudera Impala,
Hortonworks Stinger, Hadapt

In the end both of these approaches might
end up with very similar of hybrid solutions.

Utopian vision
Ultimately, ideally the user doesn’t know
(and doesn’t care) where the data is stored and
how it’s processed, as made possible by using...


Single toolset: Use single set of ETL, reporting and analytics
tools, regardless where data resides



Automated optimization: DW automatically decides between
RDBMS and Hadoop...
 Where to store data, based on its structure and directives
on how it’s to be used
 Where to push processing, based on query cost
optimization

Evolution of SQL-on-Hadoop: Operational data store
Eventually, Hadoop implementations might evolve into
supporting “operational” transactions




Ability to handle workloads to power websites and
applications
Transaction-orient write capability, rather than read-only of
analytical queries
“ACID” database capabilities, including concurrency,
distributed transactional support, and guarantees of data
consistency

Microsoft's Approach
Microsoft's strategy involves the Polybase initiative and ability to leverage
its extensive range of BI tools


Polybase is the hybrid RDBMS-Hadoop platform, which spans queries
across





HDFS on HDInsight (Hortonworks’ distribution that runs on Windows) and
Microsoft’s MPP DW platform, SQL Server PDW (Parallel Data Warehouse)

Polybase development is phased:






Phase 1: Parallel data transfer between SQL Server Compute Nodes and
Hadoop Data Nodes, but all processing is done on DBMS
Phase 2: Use query optimizer to decide where to process jobs. Selectively
push work to Hadoop by converting queries to MapReduce

Integration with Microsoft’s BI stack, including:


Reporting (SQL Server Reporting Services)



OLAP (SQL Server Analysis Services)



Self-service BI (Excel, PowerPivot, Power View)

Microsoft's Approach (cont.)
The advantages of going the Microsoft route are numerous...


Make use of Hadoop using familiar high productivity self-service BI tools


Excel itself can handle data extracts from Hadoop



PowerPivot for large-scale data exploration using xVelocity in-memory
analytics engine



Power View for ad-hoc visualization in SSRS, accessible via SharePoint or
Excel



In Office 2013, PowerPivot and Power View are natively integrated with Excel



Leverage existing and widely available .NET developers



Management of a Hadoop cluster (using Apache Ambari) is integrated
with Microsoft System Center, already used by IT Operators for
database management



Deliver tighter security through integration with Windows Server Active
Directory



Cloud-based Hadoop available through Windows Azure HDInsight
Service



Interactive access to Hadoop through Hortonworks' Hive ODBC driver

Microsoft's Approach (cont.)
And some of the disadvantages include...


Licensing costs for both Windows Server that run HDInsight nodes and
SQL Server nodes associated with Polybase



Uncertain performance and adoption of HDInsight distribution



Many of the advantages in integration and leveraging resources don’t
apply for non-Microsoft shops

Greenplum’s Pivotal HD
Greenplum’s Pivotal HD implements
But...
Greenplum on top of HDFS
 Proprietary technology means
 Yet still capable of running MapReduce
 vendor lock-in and
jobs if needed
 inability to take advantage of a vibrant
developer community
 More mature than many of its rivals
 Capabilities extend well beyond those of  Licenses are expensive and...
open source distributions
 Open source alternatives (Impala,
Drill, Shark) are becoming available

Example SQL-on-Hadoop: HadoopDB / Hadapt
Architecture of Hadapt

Hadapt is a commercialized
version of Daniel Abadi's
HadoopDB project



For structured data, each Data
Node uses DBMS (Postgres or
VectorWise) instead of HDFS



Load balancing and performance
optimization on nodes

Hive

Example SQL-on-Hadoop: BigSQL
BigSQL is PostgreSQL implemented on Hadoop

Hybrid Data Warehouse Hadoop Implementations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hybrid Data Warehouse Hadoop Implementations

Ähnlich wie Hybrid Data Warehouse Hadoop Implementations (20)

Mehr von David Portnoy

Mehr von David Portnoy (7)

Hybrid Data Warehouse Hadoop Implementations