The document discusses the evolving relationship between data warehouse (DW) and Hadoop implementations. It notes that DW vendors are incorporating Hadoop capabilities while the Hadoop ecosystem is growing to include more DW-like functions. Major DW vendors will likely continue playing a key role by acquiring successful new entrants or incorporating their technologies. The optimal approach involves a hybrid model that leverages the strengths of DWs and Hadoop, with queries determining where data resides and processing occurs. SQL-on-Hadoop architectures aim to bridge the two worlds by bringing SQL and DW tools to Hadoop.
2. Why this topic?
Note on
terminology used
In this context
RDBMS (relational
databases) are
synonymous with DW
(data warehouses)
Data Warehouse vendors are evolving to incorporate the best
Hadoop has to offer. Similarly, the Hadoop ecosystem is growing
to include capabilities previously available only to large scale (MPP)
DW platforms.
Understanding the trends and alternatives helps your organization identify the most
effective long term solution
Launched in TDWI forum on LinkedIn
(See http://bit.ly/Hybrid-DW-Hadoop)
Who are the winners in the race for the ultimate hybrid
DBMS-Hadoop implementation?
As described in http://bit.ly/Hybrid-RDBMS-Hadoop, the
industry is going to a hybrid DBMS-Hadoop model that
leverages the best of both worlds. (Microsoft for example is
building its Polybase with cost based query optimization that
decides whether to push processing to the Hadoop data
nodes or the PDW compute nodes.)
Which vendors do you see as the current leaders in this
race? And for the visionaries and philosophers among you...
How do you see it ultimately shaking out?
This group extends the TDWI
community online and is designed to
foster peer network and discussion
of key issues relevant to business
intelligence and data warehousing
managers.
TDWI (The Data Warehousing
Institute ™) provides education,
training, certification, news, and
research for executives and
information technology (IT)
professionals worldwide. Founded in
1995, TDWI is the premier
educational institute for business
intelligence and data warehousing.
Our Web site is www.tdwi.org.
3. Relationship between Hadoop & DW implementations
To leverage the strengths in each platform, traditionally...
Hadoop is used for storage & transformation (specifically,
ELT) of vast volumes of raw data
...while DW is used for analytics on a subset of the
processed data
Extract from
source systems
Load to DW
Transform
in place
DW
Reporting &
OLAP using
traditional tools
4. Why do DW vendors care about Hadoop?
...And why not just ignore it as a special use case solution?
1. Compelling price point to store high volume data, especially
if it’s not needed for real-time access
2. Has become the de facto standard for ambitious big data
projects
3. HDFS and MapReduce are becoming mature and stable
technologies (in development since 2005), despite the fact
that the rest of the ecosystem is still rapidly evolving
4. DBMS vendors have been missing a scalable distributed
file system, such as HDFS, to provide capability to store
and manipulate variable and unstructured data
5. Do companies with an MPP DW still need Hadoop?
For companies that have an MPP (Massively Parallel Processing) data
warehouse, such as Teradata or Netezza...
Couldn’t the DW platform do everything Hadoop could do?
Yes and no.
1.
Yes, you can store a lot of unstructured data in text fields, simulate
operations of key-value pairs, and scale processing capacity
horizontally, just like Hadoop
2.
But the types of processing that can be done against the DW is more
limited than with Hadoop. (Although some DW vendors now allow
open source tools, including MapReduce and R, to crunch the data.)
3.
And the cost of storing high volumes of data – especially for low usage
frequency and high latency operations – is much lower for Hadoop. So
adding Hadoop to the mix can keep the size of the more expensive DW
platform in check.
6. What paths are DW vendors taking?
DW vendors can choose from 3 typical paths
for responding to competing technologies
1.
Ignore: Hope it’s a fad that goes away
None of the major vendors
see this as a viable option
2.
Reactionary: Interoperate with existing
Hadoop products
This seems to be the most
common path for established
commercial vendors
3.
Proactive: Embrace Hadoop and
contribute to extending the ecosystem
This seems to be the
approach for new entrants
competing in a targeted
niche
7. What will happen to major DW vendors?
It’s safe to say that major commercial vendors like
IBM, Oracle, Teradata and Microsoft
will continue to be key players.
1.
Each one already has a product roadmap involving some way of
responding to or incorporating the Hadoop ecosystem.
2.
Many of the most successful new entrants will be acquired by these
vendors and incorporated into their product lines.
3.
This pattern can me seen in the historical evolution of revolutionary or
niche technologies, such as columnar databases, in-memory
databases, and self-service BI capabilities.
8. Partnerships & Alliances
Most large commercial DBMS vendors have partnerships with
specific Hadoop distribution developers.
Oracle
(Big Data Appliance)
IBM Netezza
Microsoft
(HDInsight)
Teradata &
AsterData appliance
Greenplum
(Prior to GreenplumHD)
9. Possible phases of a DW platform to evolve into a hybrid
There are many ways to get to the end goal, but here’s a possible evolution
path for commercial DW vendors to hybrid DBMS-Hadoop solutions
Phase
Description
Independent
DW and Hadoop operate independently, storing completely different data
sets depending on size and structure and processing them in completely
different ways.
Batch data
movement
There’s an efficient method to shuttle data back and forth between DW
and Hadoop. Focused primarily on loading a subset of data into the DW
for analytics. Transformations typically happen in context of ELT, rather
than ETL.
Integrated
storage
Queries are issued against the system, which in turn determines where
the requested data resides – DW or Hadoop. If data resides on Hadoop,
it pulls data into DW prior to executing a query against it.
Optimized
processing in
place
Queries are issued against the system, which in turn determines where
the requested data resides – DW or Hadoop. If data resides on Hadoop,
the query is executed in place within Hadoop, possibly after converting
the logic to MapReduce. The result is brought back to the DW.
10. How about Hadoop distribution vendors?
Hadoop distribution vendors like
Cloudera, Hortonworks and MapR
are developing products to add DW capabilities
and bridge the gap between the two worlds
Their solution:
Improve on limitations of MapReduce
Improve on data silos and overhead of moving data
between DW and Hadoop
11. What’s wrong with MapReduce?
MapReduce is the legacy and most pure processing environment for
Hadoop. But it’s not ideal for a number of reasons
Performance: Long lag between query and results make it difficult for
interactive analytics
Requires high degree of skill with Java for processing data. Existing
resources with SQL skills would be underutilized.
Doesn’t leverage existing company investment in ETL, reporting,
analytics tools
12. Why not use DW-Hadoop connectors?
The original approach of integrating DW with Hadoop using connectors to
move data back and forth is not ideal. It introduces costs and inefficiencies
of dealing with data silos. To solve this problem, a more practical “SQL-onHadoop” architecture is being adopted.
Typically, SQL-on-Hadoop capabilities
include:
Interactive analytical queries (readonly)
Parallelism / distributed processing
Efficient joins across multiple tables
ANSI SQL Compliance
Query caching
Ability to use existing ETL, OLAP and
reporting tools from commercial
vendors
SQL-on-Hadoop players:
Cloudera’s Impala
Hortonworks’ Stinger (faster
Hive via ORCfile & Tez)
Apache Drill (supported by MapR)
Hadapt / HadoopDP
Greenplum’s HAWQ
(on Pivotal HD)
Teradata’s SQL-H (on
Aster/PostgreSQL)
13. DW & Hadoop vendors approach from opposite directions
Relational
DBMS
Add
interactivity
Hybrid
DW vendors start with their
relational DBMS platform...
and add interactivity to Hadoop
Add DBMS
features
Initially all processing might occur
on DBMS, with Hadoop being used
for storage
Ultimately evolves to pushing
processing to Hadoop cluster
Examples: Microsoft PDW /
Polybase, Greenplum / Pivotal HD
Hadoop vendors start with their
Hadoop distribution...
and add DBMS features
This is also known as “SQL-onHadoop”
Add query optimizer, real-time
capabilities, etc.
Examples: Cloudera Impala,
Hortonworks Stinger, Hadapt
In the end both of these approaches might
end up with very similar of hybrid solutions.
14. Utopian vision
Ultimately, ideally the user doesn’t know
(and doesn’t care) where the data is stored and
how it’s processed, as made possible by using...
Single toolset: Use single set of ETL, reporting and analytics
tools, regardless where data resides
Automated optimization: DW automatically decides between
RDBMS and Hadoop...
Where to store data, based on its structure and directives
on how it’s to be used
Where to push processing, based on query cost
optimization
15. Evolution of SQL-on-Hadoop: Operational data store
Eventually, Hadoop implementations might evolve into
supporting “operational” transactions
Ability to handle workloads to power websites and
applications
Transaction-orient write capability, rather than read-only of
analytical queries
“ACID” database capabilities, including concurrency,
distributed transactional support, and guarantees of data
consistency
16. Microsoft's Approach
Microsoft's strategy involves the Polybase initiative and ability to leverage
its extensive range of BI tools
Polybase is the hybrid RDBMS-Hadoop platform, which spans queries
across
HDFS on HDInsight (Hortonworks’ distribution that runs on Windows) and
Microsoft’s MPP DW platform, SQL Server PDW (Parallel Data Warehouse)
Polybase development is phased:
Phase 1: Parallel data transfer between SQL Server Compute Nodes and
Hadoop Data Nodes, but all processing is done on DBMS
Phase 2: Use query optimizer to decide where to process jobs. Selectively
push work to Hadoop by converting queries to MapReduce
Integration with Microsoft’s BI stack, including:
Reporting (SQL Server Reporting Services)
OLAP (SQL Server Analysis Services)
Self-service BI (Excel, PowerPivot, Power View)
17. Microsoft's Approach (cont.)
The advantages of going the Microsoft route are numerous...
Make use of Hadoop using familiar high productivity self-service BI tools
Excel itself can handle data extracts from Hadoop
PowerPivot for large-scale data exploration using xVelocity in-memory
analytics engine
Power View for ad-hoc visualization in SSRS, accessible via SharePoint or
Excel
In Office 2013, PowerPivot and Power View are natively integrated with Excel
Leverage existing and widely available .NET developers
Management of a Hadoop cluster (using Apache Ambari) is integrated
with Microsoft System Center, already used by IT Operators for
database management
Deliver tighter security through integration with Windows Server Active
Directory
Cloud-based Hadoop available through Windows Azure HDInsight
Service
Interactive access to Hadoop through Hortonworks' Hive ODBC driver
18. Microsoft's Approach (cont.)
And some of the disadvantages include...
Licensing costs for both Windows Server that run HDInsight nodes and
SQL Server nodes associated with Polybase
Uncertain performance and adoption of HDInsight distribution
Many of the advantages in integration and leveraging resources don’t
apply for non-Microsoft shops
19. Greenplum’s Pivotal HD
Greenplum’s Pivotal HD implements
But...
Greenplum on top of HDFS
Proprietary technology means
Yet still capable of running MapReduce
vendor lock-in and
jobs if needed
inability to take advantage of a vibrant
developer community
More mature than many of its rivals
Capabilities extend well beyond those of Licenses are expensive and...
open source distributions
Open source alternatives (Impala,
Drill, Shark) are becoming available
20. Example SQL-on-Hadoop: HadoopDB / Hadapt
Architecture of Hadapt
Hadapt is a commercialized
version of Daniel Abadi's
HadoopDB project
For structured data, each Data
Node uses DBMS (Postgres or
VectorWise) instead of HDFS
Load balancing and performance
optimization on nodes
Hive