SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Building the Integrated Data
Warehouse
with Oracle Database and Hadoop

Gwen Shapira, Senior Consultant
13 years with a pager
Oracle ACE Director
Oak table member
Senior consultant for Pythian
@gwenshap
http://www.pythian.com/news
/author/shapira/
shapira@pythian.com




                          © 2012 – Pythian
Pythian
Recognized Leader:
•   Global industry-leader in remote database administration services and
    consulting for Oracle, Oracle Applications, MySQL and Microsoft SQL Server

•   Work with over 165 multinational companies such as LinkShare Corporation,
    IGN Entertainment, CrowdTwist, TinyCo and Western Union to help manage
    their complex IT deployments

Expertise:
•   One of the world’s largest concentrations of dedicated, full-time DBA
    expertise. Employ 7 Oracle ACEs/ACE Directors. Heavily involved in the
    MySQL community, driving the MySQL Professionals Group and sit on the
    IOUG Advisory Board for MySQL.

•   Hold 7 Specializations under Oracle Platinum Partner program, including
    Oracle Exadata, Oracle GoldenGate & Oracle RAC



                                          © 2012 – Pythian

3
Agenda
• What is Big Data?
• Why do we care about Big
  Data?
• Why your DWH needs
  Hadoop?
• Examples of Hadoop in the
  DWH
• How to integrate Hadoop
  into your DWH
• Avoiding major pitfalls




                             © 2012 – Pythian
What is Big Data?
Doesn’t Matter.

We are here to discuss architectures.
Not define market segments.




              © 2012 – Pythian
What Does Matter?

Some data types are a bad fit for
RDBMS.
Some problems are a bad fit for
RDBMS.

We can call them BIG if you want.
Data Warehouses have always been
BIG.             © 2012 – Pythian
Given enough skill and money –
Oracle can do anything.

Lets talk about efficient solutions.




                  © 2012 – Pythian
When RDBMS Makes no Sense?
• Storing images and video
• Processing images and video
• Storing and processing other large files
 •   PDFs, Excel files
• Processing large blocks of natural language text
 •   Blog posts, job ads, product descriptions
• Processing semi-structured data
 •   CSV, JSON, XML, log files
 •   Sensor data




                                   © 2012 – Pythian
When RDBMS Makes no Sense?
• Ad-hoc, exploratory analytics
• Integrating data from external sources
• Data cleanup tasks
• Very advanced analytics (machine learning)




                            © 2012 – Pythian
New Data Sources
• Blog posts
• Social media
• Images
• Videos
• Logs from web applications
• Sensors


They all have large potential value
But they are awkward fit for traditional data warehouses



                            © 2012 – Pythian
Your DWH needs Hadoop
Big Problems with Big Data
• It is:
 • Unstructured

 • Unprocessed

 • Un-aggregated

 • Un-filtered

 • Repetitive

 • Low     quality
 • And     generally messy.
 Oh, and there is a lot of it.

                              © 2012 – Pythian
Technical Challenges
• Storage capacity
• Storage throughput     Scalable storage
• Pipeline throughput
• Processing power
• Parallel processing   Massive Parallel Processing
• System Integration
• Data Analysis           Ready to use tools




                        © 2012 – Pythian
Hadoop Principles
 Bring Code to Data                      Share Nothing




                      © 2012 – Pythian
Hadoop in a Nutshell


Replicated Distributed Big-            Map-Reduce - framework for
Data File System                       writing massively parallel
                                       jobs




                              © 2012 – Pythian
Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to maximize parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem




                             © 2012 – Pythian
Hadoop Limitations
• Hadoop is scalable but not fast
• Batteries not included
• Instrumentation not included either
• Well-known reliability limitations




                              © 2012 – Pythian
Hadoop In the Data Warehouse
Use Cases and Customer Stories
ETL for Unstructured Data


    Logs         Flume                    Hadoop
                                           Cleanup,
  Web servers,
                                          aggregation
   app server,
                                        Longterm storage
  clickstreams




                                            DWH
                                              BI,
                                         batch reports

                     © 2012 – Pythian
ETL for Structured Data

               Sqoop,
   OLTP         Perl                     Hadoop
                                        Transformation
     Oracle,
                                         aggregation
    MySQL,
                                       Longterm storage
   Informix…




                                           DWH
                                             BI,
                                        batch reports

                    © 2012 – Pythian
Bring the World into Your Datacenter




                © 2012 – Pythian
Rare Historical Report




               © 2012 – Pythian
Find Needle in Haystack




              © 2012 – Pythian
We are not doing SQL anymore




             © 2012 – Pythian
Connecting the (big) Dots
Sqoop

        Queries




             © 2012 – Pythian
Sqoop is Flexible Import
• Select
       <columns> from <table> where
 <condition>
• Or   <write your own query>
• Split   column
• Parallel

• Incremental

• File   formats




                        © 2012 – Pythian
Sqoop Import Examples
• Sqoop import --connect
  jdbc:oracle:thin:@//dbserver:1521/masterdb
  --username hr --table emp
  --where “start_date > ’01-01-2012’”


• Sqoop import
  jdbc:oracle:thin:@//dbserver:1521/masterdb
  --username myuser
  --table shops --split-by shop_id
  --num-mappers 16                  Must be indexed or
                                            partitioned to avoid
                                            16 full table scans

                         © 2012 – Pythian
Less Flexible Export
• 100   row batch inserts
• Commit     every 100 batches
• Parallel   export
• Merge   vs. Insert
Example:
sqoop export
--connect jdbc:mysql://db.example.com/foo
--table bar
--export-dir /results/bar_data



                            © 2012 – Pythian
FUSE-DFS
• Mount HDFS on Oracle server:
 •   sudo yum install hadoop-0.20-fuse
 •   hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port>
     <mount_point>
• Use external tables to load data into Oracle
• File Formats may vary
• All ETL best practices apply




                                 © 2012 – Pythian
Oracle Loader for Hadoop
• Load data from Hadoop into Oracle
• Map-Reduce job inside Hadoop
• Converts data types, partitions and sorts
• Direct path loads
• Reduces CPU utilization on database

• NEW:
 •   Support for Avro
 •   Support for compression codecs




                                © 2012 – Pythian
Oracle Direct Connector to HDFS
• Create external tables of files in HDFS
• PREPROCESSOR HDFS_BIN_PATH:hdfs_stream
• All the features of External Tables
• Tested (by Oracle) as 5 times faster (GB/s) than FUSE-
  DFS




                             © 2012 – Pythian
Oracle SQL Connector for HDFS
• Map-Reduce Java program
• Creates an external table
• Can use Hive Metastore for schema
• Optimized for parallel queries
• Supports Avro and compression




                              © 2012 – Pythian
How not to Fail
Data That Belong in RDBMS




             © 2012 – Pythian
Prepare for Migration




               © 2012 – Pythian
Use Hadoop Efficiently
• Understand your bottlenecks:
 •   CPU, storage or network?
• Reduce use of temporary data:
 •   All data is over the network
 •   Written to disk in triplicate.
• Eliminate unbalanced
  workloads
• Offload work to RDBMS
• Fine-tune optimization with
  Map-Reduce



                                      © 2012 – Pythian
Your Data
                   is   NOT
                   as   BIG
          as you think




© 2012 – Pythian
Getting Started
•Pick a Business
 Problem
• Acquire Data
•Use right tool for the
 job
• Hadoop can start on the
  cheap
• Integrate the systems
•Analyze data
• Get operational


                            © 2012 – Pythian
Thank you & Q&A
    To contact us…

          sales@pythian.com

          1-866-PYTHIAN



    To follow us…
          http://www.pythian.com/news/

          http://www.facebook.com/pages/The-Pythian-Group/

          http://twitter.com/pythian

          http://www.linkedin.com/company/pythian



                                         © 2012 – Pythian

4

Weitere ähnliche Inhalte

Was ist angesagt?

Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]shuwutong
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 
Launching a Data Platform on Snowflake
Launching a Data Platform on SnowflakeLaunching a Data Platform on Snowflake
Launching a Data Platform on Snowflake KETL Limited
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
 
TechEvent DWH Modernization
TechEvent DWH ModernizationTechEvent DWH Modernization
TechEvent DWH ModernizationTrivadis
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudKent Graziano
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...Kent Graziano
 
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalDiego Alberto Tamayo
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
 
Hadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseHadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseCloudera, Inc.
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
 
Why Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureWhy Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureAgilisium Consulting
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014Wilfried Hoge
 

Was ist angesagt? (20)

Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Launching a Data Platform on Snowflake
Launching a Data Platform on SnowflakeLaunching a Data Platform on Snowflake
Launching a Data Platform on Snowflake
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
TechEvent DWH Modernization
TechEvent DWH ModernizationTechEvent DWH Modernization
TechEvent DWH Modernization
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data Cloud
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
 
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Hadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseHadoop: Extending your Data Warehouse
Hadoop: Extending your Data Warehouse
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
 
Why Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureWhy Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data Architecture
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 

Ähnlich wie Integrated dwh 3

Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with HadoopGwen (Chen) Shapira
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Making the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysMaking the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysDataWorks Summit
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
 
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio..."Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...Dataconomy Media
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 

Ähnlich wie Integrated dwh 3 (20)

Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with Hadoop
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to Finance
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Making the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysMaking the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British Airways
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio..."Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 

Mehr von Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 

Mehr von Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Integrated dwh 3

  • 1. Building the Integrated Data Warehouse with Oracle Database and Hadoop Gwen Shapira, Senior Consultant
  • 2. 13 years with a pager Oracle ACE Director Oak table member Senior consultant for Pythian @gwenshap http://www.pythian.com/news /author/shapira/ shapira@pythian.com © 2012 – Pythian
  • 3. Pythian Recognized Leader: • Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and Microsoft SQL Server • Work with over 165 multinational companies such as LinkShare Corporation, IGN Entertainment, CrowdTwist, TinyCo and Western Union to help manage their complex IT deployments Expertise: • One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 7 Oracle ACEs/ACE Directors. Heavily involved in the MySQL community, driving the MySQL Professionals Group and sit on the IOUG Advisory Board for MySQL. • Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC © 2012 – Pythian 3
  • 4. Agenda • What is Big Data? • Why do we care about Big Data? • Why your DWH needs Hadoop? • Examples of Hadoop in the DWH • How to integrate Hadoop into your DWH • Avoiding major pitfalls © 2012 – Pythian
  • 5. What is Big Data?
  • 6. Doesn’t Matter. We are here to discuss architectures. Not define market segments. © 2012 – Pythian
  • 7. What Does Matter? Some data types are a bad fit for RDBMS. Some problems are a bad fit for RDBMS. We can call them BIG if you want. Data Warehouses have always been BIG. © 2012 – Pythian
  • 8. Given enough skill and money – Oracle can do anything. Lets talk about efficient solutions. © 2012 – Pythian
  • 9. When RDBMS Makes no Sense? • Storing images and video • Processing images and video • Storing and processing other large files • PDFs, Excel files • Processing large blocks of natural language text • Blog posts, job ads, product descriptions • Processing semi-structured data • CSV, JSON, XML, log files • Sensor data © 2012 – Pythian
  • 10. When RDBMS Makes no Sense? • Ad-hoc, exploratory analytics • Integrating data from external sources • Data cleanup tasks • Very advanced analytics (machine learning) © 2012 – Pythian
  • 11. New Data Sources • Blog posts • Social media • Images • Videos • Logs from web applications • Sensors They all have large potential value But they are awkward fit for traditional data warehouses © 2012 – Pythian
  • 12. Your DWH needs Hadoop
  • 13. Big Problems with Big Data • It is: • Unstructured • Unprocessed • Un-aggregated • Un-filtered • Repetitive • Low quality • And generally messy. Oh, and there is a lot of it. © 2012 – Pythian
  • 14. Technical Challenges • Storage capacity • Storage throughput Scalable storage • Pipeline throughput • Processing power • Parallel processing Massive Parallel Processing • System Integration • Data Analysis Ready to use tools © 2012 – Pythian
  • 15. Hadoop Principles Bring Code to Data Share Nothing © 2012 – Pythian
  • 16. Hadoop in a Nutshell Replicated Distributed Big- Map-Reduce - framework for Data File System writing massively parallel jobs © 2012 – Pythian
  • 17. Hadoop Benefits • Reliable solution based on unreliable hardware • Designed for large files • Load data first, structure later • Designed to maximize throughput of large scans • Designed to maximize parallelism • Designed to scale • Flexible development platform • Solution Ecosystem © 2012 – Pythian
  • 18. Hadoop Limitations • Hadoop is scalable but not fast • Batteries not included • Instrumentation not included either • Well-known reliability limitations © 2012 – Pythian
  • 19. Hadoop In the Data Warehouse Use Cases and Customer Stories
  • 20. ETL for Unstructured Data Logs Flume Hadoop Cleanup, Web servers, aggregation app server, Longterm storage clickstreams DWH BI, batch reports © 2012 – Pythian
  • 21. ETL for Structured Data Sqoop, OLTP Perl Hadoop Transformation Oracle, aggregation MySQL, Longterm storage Informix… DWH BI, batch reports © 2012 – Pythian
  • 22. Bring the World into Your Datacenter © 2012 – Pythian
  • 23. Rare Historical Report © 2012 – Pythian
  • 24. Find Needle in Haystack © 2012 – Pythian
  • 25. We are not doing SQL anymore © 2012 – Pythian
  • 27. Sqoop Queries © 2012 – Pythian
  • 28. Sqoop is Flexible Import • Select <columns> from <table> where <condition> • Or <write your own query> • Split column • Parallel • Incremental • File formats © 2012 – Pythian
  • 29. Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username hr --table emp --where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table shops --split-by shop_id --num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans © 2012 – Pythian
  • 30. Less Flexible Export • 100 row batch inserts • Commit every 100 batches • Parallel export • Merge vs. Insert Example: sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /results/bar_data © 2012 – Pythian
  • 31. FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle • File Formats may vary • All ETL best practices apply © 2012 – Pythian
  • 32. Oracle Loader for Hadoop • Load data from Hadoop into Oracle • Map-Reduce job inside Hadoop • Converts data types, partitions and sorts • Direct path loads • Reduces CPU utilization on database • NEW: • Support for Avro • Support for compression codecs © 2012 – Pythian
  • 33. Oracle Direct Connector to HDFS • Create external tables of files in HDFS • PREPROCESSOR HDFS_BIN_PATH:hdfs_stream • All the features of External Tables • Tested (by Oracle) as 5 times faster (GB/s) than FUSE- DFS © 2012 – Pythian
  • 34. Oracle SQL Connector for HDFS • Map-Reduce Java program • Creates an external table • Can use Hive Metastore for schema • Optimized for parallel queries • Supports Avro and compression © 2012 – Pythian
  • 35. How not to Fail
  • 36. Data That Belong in RDBMS © 2012 – Pythian
  • 37. Prepare for Migration © 2012 – Pythian
  • 38. Use Hadoop Efficiently • Understand your bottlenecks: • CPU, storage or network? • Reduce use of temporary data: • All data is over the network • Written to disk in triplicate. • Eliminate unbalanced workloads • Offload work to RDBMS • Fine-tune optimization with Map-Reduce © 2012 – Pythian
  • 39. Your Data is NOT as BIG as you think © 2012 – Pythian
  • 40. Getting Started •Pick a Business Problem • Acquire Data •Use right tool for the job • Hadoop can start on the cheap • Integrate the systems •Analyze data • Get operational © 2012 – Pythian
  • 41. Thank you & Q&A To contact us… sales@pythian.com 1-866-PYTHIAN To follow us… http://www.pythian.com/news/ http://www.facebook.com/pages/The-Pythian-Group/ http://twitter.com/pythian http://www.linkedin.com/company/pythian © 2012 – Pythian 4

Hinweis der Redaktion

  1. We are a managed service AND a solution provider of elite database and System Administration skills in Oracle, MySQL and SQL Server
  2. Trying to separate data into “small” and “big” is not a useful segmentation. Lets look at structure, processing and data sources instead.
  3. I am going to show a lot of examples of how Hadoop is used to store and process data. I don’t want anyone to tell me “But I can do it in Oracle”. I know you can, and so can I. But there is no point in using Oracle where is it less efficient than other solutions or when it doesn’t have any specific advantage.
  4. Big data is not called big data because it fits well into a thumb-drive.It requires a lot of storage, partially because it’s a lot of data. Partially because it is unstructured, unprocessed, un-aggregated, repetitive and generally messy
  5. The ideas are simple:Data is big, code is small. It is far more efficient to move the small code to the big data than vice versa. It you have a pillow and a sofa in a room, you typically move the pillow to the sofa, and not vice versa. But many developers are too comfortable with the select-than-process anti-pattern. This principle is in place to help with the throughput challenges.2. Sharing is nice, but safe sharing of data typically mean locking, queueing, bottle necks and race conditions. It is notoriously difficult to get concurrency right, and even if you do – it is slower than the alternative. Hadoop works around the whole thing. This principle is in place to deal with parallel processing challenges
  6. Default block size is 64M,You can place any data file in HDFS. Later processing can find the meaning in the data.
  7. Many projects fail because people imagine a very rosy image of Hadoop. They think they can just throw all the data there and it will magically and quickly become value. Such misguided expectations also happen with other platforms and doom other projects too. To be successful with Hadoop, we need to be realistic about it.
  8. Note that while this presentation shows use cases where Hadoop is used to enhance the enterprise data warehouse, there are many many examples where Hadoop is the backend of an entire product. Search engines and recommendation engines are such examples. While those are important use cases, they are out of scope for this presentation.
  9. Much more controversial, especially when going from Oracle to Oracle. The data is clearly structured, so why can’t we use RDBMS (either at OLTP or DW side) to process it?Hadoop makes sense:When structured data needs integration with unstructured data before loadingWhen the ETL part of the process doesn’t scale. If 24h of data processing take more than 24h, the choice is either a bigger database or more Hadoop nodes. If the rest of the database workload scales, Hadoop is an attractive option.When Hadoop replaces homegrown Perl file processing system. It is more centralized, easier to manage and scale better.
  10. There is a lot of interesting data that is not generated by your company.Listings of businesses in specific locations.Connections in social mediaThe data may be un-structured, semi-structured or even structured. but it isn’t structured in the way your DWH expects and needs.We need a landing pad for cleanup, pre-processing, aggregating, filtering and structuring.Hadoop is perfect for this.Mappers can scrape data from websites efficiently.Map-reduce jobs that cleanup and process the data.And then load the results into your DWH.
  11. We want the top 3 items bought by left handed women between ages of 21 and 23, on November 15, 1998.How long it will take you to answer this question? For one of my customers, the answer is 25 minutes.As data grows older, it usually becomes less valuable to the business, and it gets aggregated and shelved off to tapes or other cheap storage. This means that for many organizations, answering details questions about events that happened more than few month ago is impossible or at least very challenging. The business learned to never ask those questions, because the answer is “you can’t”.Hadoop combines cheap storage and massive processing power, this allows us to store detailed history of our business, and to generate reports about it. And once the answer for questions about history is “You will have your data in 25 minutes” instead of “impossible”, the questions turn out to be less rare than we assumed.
  12. 7 Petabytes of log file data3 lines point to the security hole that allowed a break-in last weekYour DWH has aggregated information from the logs. Maybe.Hadoop is very cost effective about storing data. Lots of cheap disks, easy to throw data in without pre-processing.Search the data when you need it.
  13. Pythian is a remote DBA company. Many customers feel a bit anxious when they let people they haven’t even met into their most critical databases.One of the ways Pythian deals with this problem is by continuously recording the screen of the VM that DBAs use to connect to customer environments. Our customers have access to those videos and can replay them to check what the DBAs were doing.Our system also allows text search in the video. Perhaps you want to know if we ever issued “drop table” on the system before a critical table disappeared? Or perhaps you want to see how we handled ORA-4032 so you can learn how to do it yourself in the future?OCR of screen video capture from Pythian privileged access surveillance systemFlume streams raw frames from video captureMap-Reduce job runs OCR on frames and produces textMap-Reduce job identifies text changes from frame to frame and produces text stream with timestamp when it was on the screenOther Map-Reduce jobs mine text (nd keystrokes for insightsCredit Cart patternsSensitive commands (like DROP TABLE)Root accessUnusual activity patterns
  14. Wake up everyone, this is the meat and potatoes of the presentation. How do we integrate the Hadoop potatoes with the DWH meat?
  15. It is often said that the best way to succeed is to avoid failure for long enough. Here is some advice that will help your Hadoop projects avoid failure.
  16. * If the data is structured, especially if it arrives from a relational database, it is highly likely that a relational database will process it more efficiently than Hadoop. After all, RDBMS were built for this, with many features to support data processing tasks.OLTP workload don’t work with Hadoop at all. Just don’t try.Anything real-time will not work well with Hadoop.Most BI tools don’t integrate with Hadoop at the moment.
  17. Taking an ETL process that used to be in RDBMS and dropping it on Hadoop by exporting whole tables with Sqoop and using Hive to process the data is unlikely to be any faster. Getting value out of Hadoop involves evaluating the work, understanding bottlenecks and finding the most efficient solution. Either with Hadoop, relational database or both.
  18. Bad schema design is not big dataUsing 8 year old hardware is not big dataNot having purging policy is not big dataNot configuring your database and operating system correctly is not big dataPoor data filtering is not big data eitherKeep the data you need and use. In a way that you can actually use it.If doing this requires cutting edge technology, excellent! But don’t tell me you need NoSQL because you don’t purge data and have un-optimized PL/SQL running on 10-yo hardware.