SlideShare a Scribd company logo
1 of 34
Download to read offline
Extending the Enterprise Data Warehouse with Hadoop

                 Robert Lancaster and Jonathan Seidman
                                    Hadoop World 2011
                                     November 8 | 2011
Who We Are


•  Robert Lancaster
  –  Solutions Architect, Hotel Supply Team
  –  rlancaster@orbitz.com
  –  @rob1lancaster
  –  Co-organizer of Chicago Big Data and Chicago Machine
     Learning Study Group.
•  Jonathan Seidman
  –  Lead Engineer, Business Intelligence/Big Data Team
  –  Co-founder/organizer of Chicago Hadoop User Group and
     Chicago Big Data
  –  jseidman@orbitz.com
  –  @jseidman

                                                             page 2
Launched in 2001




                   Over 160 million
                   bookings



                                      page 3
Some History…




                page 4
In 2009…


•  The Machine Learning team is formed to improve site
   performance. For example, improving hotel search results.
•  This required access to large volumes of behavioral data for
   analysis.
   –  Fortunately, the required data was collected in session data
      stored in web analytics logs.




                                                                     page 5
The Problem…


•  The only archive of the required data went back about two
   weeks.




       Non-transactional Data       Transactional data
           (e.g. searches)          (e.g. bookings) and
                                      aggregated Non-
                                     transactional data



                                  Data Warehouse



                                                               page 6
Hadoop Provided a Solution…




          Detailed non-
       transactional data
     (what every user sees,
           clicks, etc.)
                              Transactional data
                              (e.g. bookings) and
                                aggregated Non-
                               transactional data

                              Data Warehouse



          Hadoop

                                                    page 7
Deploying Hadoop Enabled Multiple Applications…

100.00%
                                                              Queries
90.00%
80.00%                                                        Searches
     71.67%
70.00%
60.00%
50.00%
40.00%                                      34.30%
     31.87%
30.00%
20.00%
10.00%
                                            2.78%
 0.00%
          1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20




                                                                               page 8
And Useful Analyses…




                       page 9
But Brought New Challenges…


•  Most of these efforts are driven by development teams.
•  The challenge now is unlocking the value of this data for non-
   technical users.




                                                                    page 10
In Early 2011…


•  Big Data team is formed under Business Intelligence team at
   Orbitz Worldwide.
•  Allows the Big Data team to work more closely with the data
   warehouse and BI teams.
•  Reflects the importance of big data to the future of the
   company.




                                                                 page 11
A View Shared Beyond Orbitz…


“We strongly believe that Hadoop is the nucleus of the next-
generation cloud EDW…”

 “…but that promise is still three to five years from fruition.”*




                  *James Kobielus, Forrester Research,
                  “Hadoop, Is It Soup Yet?”


                                                            page 12
Two Primary Ways We Use Hadoop to Complement the EDW


•  Extraction and transformation of data for loading into the data
   warehouse – “ETL”.
•  Off-loading of analysis from the data warehouse.




                                                                     page 13
ETL Example: Proposed Dimensional Model




      Raw logs          Hadoop            Dimensional
                                             model




                                                        page 14
ETL Example: Click Data Processing




Web
                                               Data
Server	

 Web                                           Cleansing
  Web
 Server	

   Logs     ETL           DW         (Stored            DW
  Servers
                                               procedure)

                            Several hours of processing     ~20% original
                                                              data size

                    Current Processing in Data Warehouse


                                                                       page 15
ETL Example: Click Data Processing


•  Moving to Hadoop will facilitate:
   –  Remove load from the data warehouse.
   –  Adding additional attributes for processing.
   –  Allow processing to be run more frequently.


 Web                                   Data
 Server	

  Web                                  Cleansing
   Web
  Server	

   Logs   Hadoop            (MapReduce)   DW
   Servers




                Proposed Processing in Hadoop


                                                          page 16
Analysis Example: Geo-Targeting Ads


•  Facilitated analysis that allows for more personalized ad
   content.
•  Allowed marketing team to analyze over a years worth of
   search data.
•  Provided analysis that was difficult to perform in the data
   warehouse.




                                                                 page 17
BI Vendors Are Working on Hadoop Integration

Both big (relatively)…




                                               page 18
And small…




             page 19
Example Processing Pipeline for Web Analytics Data




                                                     page 20
Example Use Case: Selection Errors




                                     page 21
Use Case – Selection Errors: Introduction


                                 •  Multiple points of entry.
                                 •  Multiple paths through site.
                                 •  Goal: tie events together to
                                    form picture of customer
                                    behavior.




                                                                page 22
Use Case – Selection Errors: Processing




                                          page 23
Use Case – Selection Errors: Visualization




                                             page 24
Example Use Case: Beta Data




                              page 25
Use Case – Beta Data: Introduction




                              •  Hotel Sort Optimization
                              •  Compare A vs. B
                              •  Web Analytics Data
                                 –  What user saw.
                                 –  How user behaved
                              •  Server Log Data
                                 –  Sorting behavior used.




                                                             page 26
Use Case – Beta Data Processing




                                  page 27
Use Case – Beta Data: Visualization




                                      page 28
Example Use Case: RCDC




                         page 29
Use Case – RCDC: Introduction


•  Understand and improve cache behavior.
•  Improve “coverage”
  –  Traditionally search 1 page of hotels at a time.
  –  Get “just enough” information to present to consumers.
  –  Increase amount of availability information we have when
     consumer performs a search.
•  Data needed to support needs beyond reporting.




                                                                page 30
Use Case – RCDC: Processing




                              page 31
Use Case – RCDC: Visualization




                                 page 32
Conclusions


•  Hadoop market is still immature, but growing quickly. Better
   tools are on the way.
   –  Look beyond the usual (enterprise) suspects. Many of the
      most interesting companies in the big data space are small
      startups.
•  Hadoop will not replace your data warehouse, but any
   organization with a large data warehouse should at least be
   exploring Hadoop as a complement to their BI infrastructure.




                                                                   page 33
Conclusions


•  Work closely with your existing data management teams.
   –  Your idea of what constitutes big data might quickly
      diverge from theirs.
•  The flip-side to this is that Hadoop can be an excellent tool to
   off-load resource-consuming jobs from your data warehouse.




                                                                      page 34

More Related Content

What's hot

Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
Hari Shankar Sreekumar
 

What's hot (20)

BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware Hadoop
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 

Viewers also liked

Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Viewers also liked (8)

Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 

Similar to Extending the Data Warehouse with Hadoop - Hadoop world 2011

Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Cloudera, Inc.
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
email2jl
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud
 

Similar to Extending the Data Warehouse with Hadoop - Hadoop world 2011 (20)

Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Webinar: How Banks Manage Reference Data with MongoDB
 Webinar: How Banks Manage Reference Data with MongoDB Webinar: How Banks Manage Reference Data with MongoDB
Webinar: How Banks Manage Reference Data with MongoDB
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data Architecture
 
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Pass bac jd_sm
Pass bac jd_smPass bac jd_sm
Pass bac jd_sm
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 

More from Jonathan Seidman

More from Jonathan Seidman (10)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Extending the Data Warehouse with Hadoop - Hadoop world 2011

  • 1. Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Hadoop World 2011 November 8 | 2011
  • 2. Who We Are •  Robert Lancaster –  Solutions Architect, Hotel Supply Team –  rlancaster@orbitz.com –  @rob1lancaster –  Co-organizer of Chicago Big Data and Chicago Machine Learning Study Group. •  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data –  jseidman@orbitz.com –  @jseidman page 2
  • 3. Launched in 2001 Over 160 million bookings page 3
  • 5. In 2009… •  The Machine Learning team is formed to improve site performance. For example, improving hotel search results. •  This required access to large volumes of behavioral data for analysis. –  Fortunately, the required data was collected in session data stored in web analytics logs. page 5
  • 6. The Problem… •  The only archive of the required data went back about two weeks. Non-transactional Data Transactional data (e.g. searches) (e.g. bookings) and aggregated Non- transactional data Data Warehouse page 6
  • 7. Hadoop Provided a Solution… Detailed non- transactional data (what every user sees, clicks, etc.) Transactional data (e.g. bookings) and aggregated Non- transactional data Data Warehouse Hadoop page 7
  • 8. Deploying Hadoop Enabled Multiple Applications… 100.00% Queries 90.00% 80.00% Searches 71.67% 70.00% 60.00% 50.00% 40.00% 34.30% 31.87% 30.00% 20.00% 10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 8
  • 10. But Brought New Challenges… •  Most of these efforts are driven by development teams. •  The challenge now is unlocking the value of this data for non- technical users. page 10
  • 11. In Early 2011… •  Big Data team is formed under Business Intelligence team at Orbitz Worldwide. •  Allows the Big Data team to work more closely with the data warehouse and BI teams. •  Reflects the importance of big data to the future of the company. page 11
  • 12. A View Shared Beyond Orbitz… “We strongly believe that Hadoop is the nucleus of the next- generation cloud EDW…” “…but that promise is still three to five years from fruition.”* *James Kobielus, Forrester Research, “Hadoop, Is It Soup Yet?” page 12
  • 13. Two Primary Ways We Use Hadoop to Complement the EDW •  Extraction and transformation of data for loading into the data warehouse – “ETL”. •  Off-loading of analysis from the data warehouse. page 13
  • 14. ETL Example: Proposed Dimensional Model Raw logs Hadoop Dimensional model page 14
  • 15. ETL Example: Click Data Processing Web Data Server Web Cleansing Web Server Logs ETL DW (Stored DW Servers procedure) Several hours of processing ~20% original data size Current Processing in Data Warehouse page 15
  • 16. ETL Example: Click Data Processing •  Moving to Hadoop will facilitate: –  Remove load from the data warehouse. –  Adding additional attributes for processing. –  Allow processing to be run more frequently. Web Data Server Web Cleansing Web Server Logs Hadoop (MapReduce) DW Servers Proposed Processing in Hadoop page 16
  • 17. Analysis Example: Geo-Targeting Ads •  Facilitated analysis that allows for more personalized ad content. •  Allowed marketing team to analyze over a years worth of search data. •  Provided analysis that was difficult to perform in the data warehouse. page 17
  • 18. BI Vendors Are Working on Hadoop Integration Both big (relatively)… page 18
  • 19. And small… page 19
  • 20. Example Processing Pipeline for Web Analytics Data page 20
  • 21. Example Use Case: Selection Errors page 21
  • 22. Use Case – Selection Errors: Introduction •  Multiple points of entry. •  Multiple paths through site. •  Goal: tie events together to form picture of customer behavior. page 22
  • 23. Use Case – Selection Errors: Processing page 23
  • 24. Use Case – Selection Errors: Visualization page 24
  • 25. Example Use Case: Beta Data page 25
  • 26. Use Case – Beta Data: Introduction •  Hotel Sort Optimization •  Compare A vs. B •  Web Analytics Data –  What user saw. –  How user behaved •  Server Log Data –  Sorting behavior used. page 26
  • 27. Use Case – Beta Data Processing page 27
  • 28. Use Case – Beta Data: Visualization page 28
  • 29. Example Use Case: RCDC page 29
  • 30. Use Case – RCDC: Introduction •  Understand and improve cache behavior. •  Improve “coverage” –  Traditionally search 1 page of hotels at a time. –  Get “just enough” information to present to consumers. –  Increase amount of availability information we have when consumer performs a search. •  Data needed to support needs beyond reporting. page 30
  • 31. Use Case – RCDC: Processing page 31
  • 32. Use Case – RCDC: Visualization page 32
  • 33. Conclusions •  Hadoop market is still immature, but growing quickly. Better tools are on the way. –  Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups. •  Hadoop will not replace your data warehouse, but any organization with a large data warehouse should at least be exploring Hadoop as a complement to their BI infrastructure. page 33
  • 34. Conclusions •  Work closely with your existing data management teams. –  Your idea of what constitutes big data might quickly diverge from theirs. •  The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse. page 34