SlideShare ist ein Scribd-Unternehmen logo
1 von 21
The future of hybrid
Data Warehouse-Hadoop
implementations
-

-

David Portnoy
Datalytx, Inc.
312.970.9740
http://LinkedIn.com/in/DavidPortnoy
-

-

-

© Copyright 2013 David Portnoy and Datalytx, Inc.

-

-

-
Why this topic?

Note on
terminology used
In this context
RDBMS (relational
databases) are
synonymous with DW
(data warehouses)



Data Warehouse vendors are evolving to incorporate the best
Hadoop has to offer. Similarly, the Hadoop ecosystem is growing
to include capabilities previously available only to large scale (MPP)
DW platforms.



Understanding the trends and alternatives helps your organization identify the most
effective long term solution



Launched in TDWI forum on LinkedIn

(See http://bit.ly/Hybrid-DW-Hadoop)

Who are the winners in the race for the ultimate hybrid
DBMS-Hadoop implementation?
As described in http://bit.ly/Hybrid-RDBMS-Hadoop, the
industry is going to a hybrid DBMS-Hadoop model that
leverages the best of both worlds. (Microsoft for example is
building its Polybase with cost based query optimization that
decides whether to push processing to the Hadoop data
nodes or the PDW compute nodes.)
Which vendors do you see as the current leaders in this
race? And for the visionaries and philosophers among you...
How do you see it ultimately shaking out?

This group extends the TDWI
community online and is designed to
foster peer network and discussion
of key issues relevant to business
intelligence and data warehousing
managers.
TDWI (The Data Warehousing
Institute ™) provides education,
training, certification, news, and
research for executives and
information technology (IT)
professionals worldwide. Founded in
1995, TDWI is the premier
educational institute for business
intelligence and data warehousing.
Our Web site is www.tdwi.org.
Relationship between Hadoop & DW implementations
To leverage the strengths in each platform, traditionally...


Hadoop is used for storage & transformation (specifically,
ELT) of vast volumes of raw data



...while DW is used for analytics on a subset of the
processed data

Extract from
source systems

Load to DW

Transform
in place

DW

Reporting &
OLAP using
traditional tools
Why do DW vendors care about Hadoop?
...And why not just ignore it as a special use case solution?
1. Compelling price point to store high volume data, especially

if it’s not needed for real-time access
2. Has become the de facto standard for ambitious big data

projects
3. HDFS and MapReduce are becoming mature and stable

technologies (in development since 2005), despite the fact
that the rest of the ecosystem is still rapidly evolving
4. DBMS vendors have been missing a scalable distributed

file system, such as HDFS, to provide capability to store
and manipulate variable and unstructured data
Do companies with an MPP DW still need Hadoop?
For companies that have an MPP (Massively Parallel Processing) data
warehouse, such as Teradata or Netezza...
Couldn’t the DW platform do everything Hadoop could do?

Yes and no.
1.

Yes, you can store a lot of unstructured data in text fields, simulate
operations of key-value pairs, and scale processing capacity
horizontally, just like Hadoop

2.

But the types of processing that can be done against the DW is more
limited than with Hadoop. (Although some DW vendors now allow
open source tools, including MapReduce and R, to crunch the data.)

3.

And the cost of storing high volumes of data – especially for low usage
frequency and high latency operations – is much lower for Hadoop. So
adding Hadoop to the mix can keep the size of the more expensive DW
platform in check.
What paths are DW vendors taking?
DW vendors can choose from 3 typical paths
for responding to competing technologies
1.

Ignore: Hope it’s a fad that goes away

None of the major vendors
see this as a viable option

2.

Reactionary: Interoperate with existing
Hadoop products

This seems to be the most
common path for established
commercial vendors

3.

Proactive: Embrace Hadoop and
contribute to extending the ecosystem

This seems to be the
approach for new entrants
competing in a targeted
niche
What will happen to major DW vendors?
It’s safe to say that major commercial vendors like
IBM, Oracle, Teradata and Microsoft
will continue to be key players.

1.

Each one already has a product roadmap involving some way of
responding to or incorporating the Hadoop ecosystem.

2.

Many of the most successful new entrants will be acquired by these
vendors and incorporated into their product lines.

3.

This pattern can me seen in the historical evolution of revolutionary or
niche technologies, such as columnar databases, in-memory
databases, and self-service BI capabilities.
Partnerships & Alliances
Most large commercial DBMS vendors have partnerships with
specific Hadoop distribution developers.
Oracle
(Big Data Appliance)
IBM Netezza

Microsoft
(HDInsight)

Teradata &
AsterData appliance

Greenplum
(Prior to GreenplumHD)
Possible phases of a DW platform to evolve into a hybrid
There are many ways to get to the end goal, but here’s a possible evolution
path for commercial DW vendors to hybrid DBMS-Hadoop solutions
Phase

Description

Independent

DW and Hadoop operate independently, storing completely different data
sets depending on size and structure and processing them in completely
different ways.

Batch data
movement

There’s an efficient method to shuttle data back and forth between DW
and Hadoop. Focused primarily on loading a subset of data into the DW
for analytics. Transformations typically happen in context of ELT, rather
than ETL.

Integrated
storage

Queries are issued against the system, which in turn determines where
the requested data resides – DW or Hadoop. If data resides on Hadoop,
it pulls data into DW prior to executing a query against it.

Optimized
processing in
place

Queries are issued against the system, which in turn determines where
the requested data resides – DW or Hadoop. If data resides on Hadoop,
the query is executed in place within Hadoop, possibly after converting
the logic to MapReduce. The result is brought back to the DW.
How about Hadoop distribution vendors?
Hadoop distribution vendors like
Cloudera, Hortonworks and MapR
are developing products to add DW capabilities
and bridge the gap between the two worlds
Their solution:
 Improve on limitations of MapReduce
 Improve on data silos and overhead of moving data
between DW and Hadoop
What’s wrong with MapReduce?
MapReduce is the legacy and most pure processing environment for
Hadoop. But it’s not ideal for a number of reasons


Performance: Long lag between query and results make it difficult for
interactive analytics



Requires high degree of skill with Java for processing data. Existing
resources with SQL skills would be underutilized.



Doesn’t leverage existing company investment in ETL, reporting,
analytics tools
Why not use DW-Hadoop connectors?
The original approach of integrating DW with Hadoop using connectors to
move data back and forth is not ideal. It introduces costs and inefficiencies
of dealing with data silos. To solve this problem, a more practical “SQL-onHadoop” architecture is being adopted.
Typically, SQL-on-Hadoop capabilities
include:
 Interactive analytical queries (readonly)



Parallelism / distributed processing



Efficient joins across multiple tables



ANSI SQL Compliance



Query caching



Ability to use existing ETL, OLAP and
reporting tools from commercial
vendors

SQL-on-Hadoop players:
 Cloudera’s Impala
 Hortonworks’ Stinger (faster
Hive via ORCfile & Tez)
 Apache Drill (supported by MapR)
 Hadapt / HadoopDP
 Greenplum’s HAWQ
(on Pivotal HD)
 Teradata’s SQL-H (on
Aster/PostgreSQL)
DW & Hadoop vendors approach from opposite directions
Relational

DBMS

Add
interactivity

Hybrid

DW vendors start with their
relational DBMS platform...
and add interactivity to Hadoop





Add DBMS
features

Initially all processing might occur
on DBMS, with Hadoop being used
for storage
Ultimately evolves to pushing
processing to Hadoop cluster
Examples: Microsoft PDW /
Polybase, Greenplum / Pivotal HD

Hadoop vendors start with their
Hadoop distribution...
and add DBMS features




This is also known as “SQL-onHadoop”
Add query optimizer, real-time
capabilities, etc.
Examples: Cloudera Impala,
Hortonworks Stinger, Hadapt

In the end both of these approaches might
end up with very similar of hybrid solutions.
Utopian vision
Ultimately, ideally the user doesn’t know
(and doesn’t care) where the data is stored and
how it’s processed, as made possible by using...


Single toolset: Use single set of ETL, reporting and analytics
tools, regardless where data resides



Automated optimization: DW automatically decides between
RDBMS and Hadoop...
 Where to store data, based on its structure and directives
on how it’s to be used
 Where to push processing, based on query cost
optimization
Evolution of SQL-on-Hadoop: Operational data store
Eventually, Hadoop implementations might evolve into
supporting “operational” transactions




Ability to handle workloads to power websites and
applications
Transaction-orient write capability, rather than read-only of
analytical queries
“ACID” database capabilities, including concurrency,
distributed transactional support, and guarantees of data
consistency
Microsoft's Approach
Microsoft's strategy involves the Polybase initiative and ability to leverage
its extensive range of BI tools


Polybase is the hybrid RDBMS-Hadoop platform, which spans queries
across





HDFS on HDInsight (Hortonworks’ distribution that runs on Windows) and
Microsoft’s MPP DW platform, SQL Server PDW (Parallel Data Warehouse)

Polybase development is phased:






Phase 1: Parallel data transfer between SQL Server Compute Nodes and
Hadoop Data Nodes, but all processing is done on DBMS
Phase 2: Use query optimizer to decide where to process jobs. Selectively
push work to Hadoop by converting queries to MapReduce

Integration with Microsoft’s BI stack, including:


Reporting (SQL Server Reporting Services)



OLAP (SQL Server Analysis Services)



Self-service BI (Excel, PowerPivot, Power View)
Microsoft's Approach (cont.)
The advantages of going the Microsoft route are numerous...


Make use of Hadoop using familiar high productivity self-service BI tools


Excel itself can handle data extracts from Hadoop



PowerPivot for large-scale data exploration using xVelocity in-memory
analytics engine



Power View for ad-hoc visualization in SSRS, accessible via SharePoint or
Excel



In Office 2013, PowerPivot and Power View are natively integrated with Excel



Leverage existing and widely available .NET developers



Management of a Hadoop cluster (using Apache Ambari) is integrated
with Microsoft System Center, already used by IT Operators for
database management



Deliver tighter security through integration with Windows Server Active
Directory



Cloud-based Hadoop available through Windows Azure HDInsight
Service



Interactive access to Hadoop through Hortonworks' Hive ODBC driver
Microsoft's Approach (cont.)
And some of the disadvantages include...


Licensing costs for both Windows Server that run HDInsight nodes and
SQL Server nodes associated with Polybase



Uncertain performance and adoption of HDInsight distribution



Many of the advantages in integration and leveraging resources don’t
apply for non-Microsoft shops
Greenplum’s Pivotal HD
Greenplum’s Pivotal HD implements
But...
Greenplum on top of HDFS
 Proprietary technology means
 Yet still capable of running MapReduce
 vendor lock-in and
jobs if needed
 inability to take advantage of a vibrant
developer community
 More mature than many of its rivals
 Capabilities extend well beyond those of  Licenses are expensive and...
open source distributions
 Open source alternatives (Impala,
Drill, Shark) are becoming available
Example SQL-on-Hadoop: HadoopDB / Hadapt
Architecture of Hadapt

Hadapt is a commercialized
version of Daniel Abadi's
HadoopDB project



For structured data, each Data
Node uses DBMS (Postgres or
VectorWise) instead of HDFS



Load balancing and performance
optimization on nodes

Hive
Example SQL-on-Hadoop: BigSQL
BigSQL is PostgreSQL implemented on Hadoop

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Cloudera, Inc.
 

Was ist angesagt? (20)

Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 

Andere mochten auch

Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
DataWorks Summit
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadata
Louis liu
 
adage-factpack-neustar-final
adage-factpack-neustar-finaladage-factpack-neustar-final
adage-factpack-neustar-final
angielynncul
 

Andere mochten auch (20)

Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business Intelligence
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 
Open Data Discoverability
Open Data DiscoverabilityOpen Data Discoverability
Open Data Discoverability
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataMicrosoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadata
 
KEYMUSIC customer case by Arlanet, From Bricks to Clicks
KEYMUSIC customer case by Arlanet, From Bricks to ClicksKEYMUSIC customer case by Arlanet, From Bricks to Clicks
KEYMUSIC customer case by Arlanet, From Bricks to Clicks
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
adage-factpack-neustar-final
adage-factpack-neustar-finaladage-factpack-neustar-final
adage-factpack-neustar-final
 
Scaling self service on Hadoop
Scaling self service on HadoopScaling self service on Hadoop
Scaling self service on Hadoop
 
Expand a Data warehouse with Hadoop and Big Data
Expand a Data warehouse with Hadoop and Big DataExpand a Data warehouse with Hadoop and Big Data
Expand a Data warehouse with Hadoop and Big Data
 

Ähnlich wie Hybrid Data Warehouse Hadoop Implementations

Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
Rajesh Angadi
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
Silicon Halton
 
Hadoop training kit from lcc infotech
Hadoop   training kit from lcc infotechHadoop   training kit from lcc infotech
Hadoop training kit from lcc infotech
lccinfotech
 

Ähnlich wie Hybrid Data Warehouse Hadoop Implementations (20)

Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
 
Hadoop Training in Delhi
Hadoop Training in DelhiHadoop Training in Delhi
Hadoop Training in Delhi
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architecture
 
Actian DataFlow Whitepaper
Actian DataFlow WhitepaperActian DataFlow Whitepaper
Actian DataFlow Whitepaper
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big Data Hadoop Training- Multisoft Systems
Big Data Hadoop Training- Multisoft SystemsBig Data Hadoop Training- Multisoft Systems
Big Data Hadoop Training- Multisoft Systems
 
Hadoop training kit from lcc infotech
Hadoop   training kit from lcc infotechHadoop   training kit from lcc infotech
Hadoop training kit from lcc infotech
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
 

Mehr von David Portnoy

Mehr von David Portnoy (7)

DDOD framework infographic
DDOD framework infographicDDOD framework infographic
DDOD framework infographic
 
Impact of DDOD on Data Quality - White House 2016
Impact of DDOD on Data Quality -  White House 2016Impact of DDOD on Data Quality -  White House 2016
Impact of DDOD on Data Quality - White House 2016
 
Industry Uses of HHS Data
Industry Uses of HHS DataIndustry Uses of HHS Data
Industry Uses of HHS Data
 
DDOD for FOIA organizations
DDOD for FOIA organizationsDDOD for FOIA organizations
DDOD for FOIA organizations
 
Intro to Demand-Driven Open Data for Data Owners
Intro to Demand-Driven Open Data for Data OwnersIntro to Demand-Driven Open Data for Data Owners
Intro to Demand-Driven Open Data for Data Owners
 
Intro to Demand Driven Open Data for Data Users
Intro to Demand Driven Open Data for Data UsersIntro to Demand Driven Open Data for Data Users
Intro to Demand Driven Open Data for Data Users
 
Case Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human GenomeCase Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human Genome
 

Hybrid Data Warehouse Hadoop Implementations

  • 1. The future of hybrid Data Warehouse-Hadoop implementations - - David Portnoy Datalytx, Inc. 312.970.9740 http://LinkedIn.com/in/DavidPortnoy - - - © Copyright 2013 David Portnoy and Datalytx, Inc. - - -
  • 2. Why this topic? Note on terminology used In this context RDBMS (relational databases) are synonymous with DW (data warehouses)  Data Warehouse vendors are evolving to incorporate the best Hadoop has to offer. Similarly, the Hadoop ecosystem is growing to include capabilities previously available only to large scale (MPP) DW platforms.  Understanding the trends and alternatives helps your organization identify the most effective long term solution  Launched in TDWI forum on LinkedIn (See http://bit.ly/Hybrid-DW-Hadoop) Who are the winners in the race for the ultimate hybrid DBMS-Hadoop implementation? As described in http://bit.ly/Hybrid-RDBMS-Hadoop, the industry is going to a hybrid DBMS-Hadoop model that leverages the best of both worlds. (Microsoft for example is building its Polybase with cost based query optimization that decides whether to push processing to the Hadoop data nodes or the PDW compute nodes.) Which vendors do you see as the current leaders in this race? And for the visionaries and philosophers among you... How do you see it ultimately shaking out? This group extends the TDWI community online and is designed to foster peer network and discussion of key issues relevant to business intelligence and data warehousing managers. TDWI (The Data Warehousing Institute ™) provides education, training, certification, news, and research for executives and information technology (IT) professionals worldwide. Founded in 1995, TDWI is the premier educational institute for business intelligence and data warehousing. Our Web site is www.tdwi.org.
  • 3. Relationship between Hadoop & DW implementations To leverage the strengths in each platform, traditionally...  Hadoop is used for storage & transformation (specifically, ELT) of vast volumes of raw data  ...while DW is used for analytics on a subset of the processed data Extract from source systems Load to DW Transform in place DW Reporting & OLAP using traditional tools
  • 4. Why do DW vendors care about Hadoop? ...And why not just ignore it as a special use case solution? 1. Compelling price point to store high volume data, especially if it’s not needed for real-time access 2. Has become the de facto standard for ambitious big data projects 3. HDFS and MapReduce are becoming mature and stable technologies (in development since 2005), despite the fact that the rest of the ecosystem is still rapidly evolving 4. DBMS vendors have been missing a scalable distributed file system, such as HDFS, to provide capability to store and manipulate variable and unstructured data
  • 5. Do companies with an MPP DW still need Hadoop? For companies that have an MPP (Massively Parallel Processing) data warehouse, such as Teradata or Netezza... Couldn’t the DW platform do everything Hadoop could do? Yes and no. 1. Yes, you can store a lot of unstructured data in text fields, simulate operations of key-value pairs, and scale processing capacity horizontally, just like Hadoop 2. But the types of processing that can be done against the DW is more limited than with Hadoop. (Although some DW vendors now allow open source tools, including MapReduce and R, to crunch the data.) 3. And the cost of storing high volumes of data – especially for low usage frequency and high latency operations – is much lower for Hadoop. So adding Hadoop to the mix can keep the size of the more expensive DW platform in check.
  • 6. What paths are DW vendors taking? DW vendors can choose from 3 typical paths for responding to competing technologies 1. Ignore: Hope it’s a fad that goes away None of the major vendors see this as a viable option 2. Reactionary: Interoperate with existing Hadoop products This seems to be the most common path for established commercial vendors 3. Proactive: Embrace Hadoop and contribute to extending the ecosystem This seems to be the approach for new entrants competing in a targeted niche
  • 7. What will happen to major DW vendors? It’s safe to say that major commercial vendors like IBM, Oracle, Teradata and Microsoft will continue to be key players. 1. Each one already has a product roadmap involving some way of responding to or incorporating the Hadoop ecosystem. 2. Many of the most successful new entrants will be acquired by these vendors and incorporated into their product lines. 3. This pattern can me seen in the historical evolution of revolutionary or niche technologies, such as columnar databases, in-memory databases, and self-service BI capabilities.
  • 8. Partnerships & Alliances Most large commercial DBMS vendors have partnerships with specific Hadoop distribution developers. Oracle (Big Data Appliance) IBM Netezza Microsoft (HDInsight) Teradata & AsterData appliance Greenplum (Prior to GreenplumHD)
  • 9. Possible phases of a DW platform to evolve into a hybrid There are many ways to get to the end goal, but here’s a possible evolution path for commercial DW vendors to hybrid DBMS-Hadoop solutions Phase Description Independent DW and Hadoop operate independently, storing completely different data sets depending on size and structure and processing them in completely different ways. Batch data movement There’s an efficient method to shuttle data back and forth between DW and Hadoop. Focused primarily on loading a subset of data into the DW for analytics. Transformations typically happen in context of ELT, rather than ETL. Integrated storage Queries are issued against the system, which in turn determines where the requested data resides – DW or Hadoop. If data resides on Hadoop, it pulls data into DW prior to executing a query against it. Optimized processing in place Queries are issued against the system, which in turn determines where the requested data resides – DW or Hadoop. If data resides on Hadoop, the query is executed in place within Hadoop, possibly after converting the logic to MapReduce. The result is brought back to the DW.
  • 10. How about Hadoop distribution vendors? Hadoop distribution vendors like Cloudera, Hortonworks and MapR are developing products to add DW capabilities and bridge the gap between the two worlds Their solution:  Improve on limitations of MapReduce  Improve on data silos and overhead of moving data between DW and Hadoop
  • 11. What’s wrong with MapReduce? MapReduce is the legacy and most pure processing environment for Hadoop. But it’s not ideal for a number of reasons  Performance: Long lag between query and results make it difficult for interactive analytics  Requires high degree of skill with Java for processing data. Existing resources with SQL skills would be underutilized.  Doesn’t leverage existing company investment in ETL, reporting, analytics tools
  • 12. Why not use DW-Hadoop connectors? The original approach of integrating DW with Hadoop using connectors to move data back and forth is not ideal. It introduces costs and inefficiencies of dealing with data silos. To solve this problem, a more practical “SQL-onHadoop” architecture is being adopted. Typically, SQL-on-Hadoop capabilities include:  Interactive analytical queries (readonly)  Parallelism / distributed processing  Efficient joins across multiple tables  ANSI SQL Compliance  Query caching  Ability to use existing ETL, OLAP and reporting tools from commercial vendors SQL-on-Hadoop players:  Cloudera’s Impala  Hortonworks’ Stinger (faster Hive via ORCfile & Tez)  Apache Drill (supported by MapR)  Hadapt / HadoopDP  Greenplum’s HAWQ (on Pivotal HD)  Teradata’s SQL-H (on Aster/PostgreSQL)
  • 13. DW & Hadoop vendors approach from opposite directions Relational DBMS Add interactivity Hybrid DW vendors start with their relational DBMS platform... and add interactivity to Hadoop    Add DBMS features Initially all processing might occur on DBMS, with Hadoop being used for storage Ultimately evolves to pushing processing to Hadoop cluster Examples: Microsoft PDW / Polybase, Greenplum / Pivotal HD Hadoop vendors start with their Hadoop distribution... and add DBMS features    This is also known as “SQL-onHadoop” Add query optimizer, real-time capabilities, etc. Examples: Cloudera Impala, Hortonworks Stinger, Hadapt In the end both of these approaches might end up with very similar of hybrid solutions.
  • 14. Utopian vision Ultimately, ideally the user doesn’t know (and doesn’t care) where the data is stored and how it’s processed, as made possible by using...  Single toolset: Use single set of ETL, reporting and analytics tools, regardless where data resides  Automated optimization: DW automatically decides between RDBMS and Hadoop...  Where to store data, based on its structure and directives on how it’s to be used  Where to push processing, based on query cost optimization
  • 15. Evolution of SQL-on-Hadoop: Operational data store Eventually, Hadoop implementations might evolve into supporting “operational” transactions    Ability to handle workloads to power websites and applications Transaction-orient write capability, rather than read-only of analytical queries “ACID” database capabilities, including concurrency, distributed transactional support, and guarantees of data consistency
  • 16. Microsoft's Approach Microsoft's strategy involves the Polybase initiative and ability to leverage its extensive range of BI tools  Polybase is the hybrid RDBMS-Hadoop platform, which spans queries across    HDFS on HDInsight (Hortonworks’ distribution that runs on Windows) and Microsoft’s MPP DW platform, SQL Server PDW (Parallel Data Warehouse) Polybase development is phased:    Phase 1: Parallel data transfer between SQL Server Compute Nodes and Hadoop Data Nodes, but all processing is done on DBMS Phase 2: Use query optimizer to decide where to process jobs. Selectively push work to Hadoop by converting queries to MapReduce Integration with Microsoft’s BI stack, including:  Reporting (SQL Server Reporting Services)  OLAP (SQL Server Analysis Services)  Self-service BI (Excel, PowerPivot, Power View)
  • 17. Microsoft's Approach (cont.) The advantages of going the Microsoft route are numerous...  Make use of Hadoop using familiar high productivity self-service BI tools  Excel itself can handle data extracts from Hadoop  PowerPivot for large-scale data exploration using xVelocity in-memory analytics engine  Power View for ad-hoc visualization in SSRS, accessible via SharePoint or Excel  In Office 2013, PowerPivot and Power View are natively integrated with Excel  Leverage existing and widely available .NET developers  Management of a Hadoop cluster (using Apache Ambari) is integrated with Microsoft System Center, already used by IT Operators for database management  Deliver tighter security through integration with Windows Server Active Directory  Cloud-based Hadoop available through Windows Azure HDInsight Service  Interactive access to Hadoop through Hortonworks' Hive ODBC driver
  • 18. Microsoft's Approach (cont.) And some of the disadvantages include...  Licensing costs for both Windows Server that run HDInsight nodes and SQL Server nodes associated with Polybase  Uncertain performance and adoption of HDInsight distribution  Many of the advantages in integration and leveraging resources don’t apply for non-Microsoft shops
  • 19. Greenplum’s Pivotal HD Greenplum’s Pivotal HD implements But... Greenplum on top of HDFS  Proprietary technology means  Yet still capable of running MapReduce  vendor lock-in and jobs if needed  inability to take advantage of a vibrant developer community  More mature than many of its rivals  Capabilities extend well beyond those of  Licenses are expensive and... open source distributions  Open source alternatives (Impala, Drill, Shark) are becoming available
  • 20. Example SQL-on-Hadoop: HadoopDB / Hadapt Architecture of Hadapt Hadapt is a commercialized version of Daniel Abadi's HadoopDB project  For structured data, each Data Node uses DBMS (Postgres or VectorWise) instead of HDFS  Load balancing and performance optimization on nodes Hive
  • 21. Example SQL-on-Hadoop: BigSQL BigSQL is PostgreSQL implemented on Hadoop