SlideShare a Scribd company logo
1 of 35
Download to read offline
Extending the Enterprise Data Warehouse with Hadoop

                 Robert Lancaster and Jonathan Seidman
                                  Chicago Data Summit
                                         April 26 | 2011
Who We Are


•  Robert Lancaster
  –  Solutions Architect, Hotel Supply Team
  –  rlancaster@orbitz.com
  –  @rob1lancaster
•  Jonathan Seidman
  –  Lead Engineer, Business Intelligence/Big Data Team
  –  Co-founder/organizer of Chicago Hadoop User Group
     (http://www.meetup.com/Chicago-area-Hadoop-User-
     Group-CHUG)
  –  jseidman@orbitz.com
  –  @jseidman



                                                          page 2
Launched: 2001, Chicago, IL




                              page 3
Why are we using Hadoop?

 Stop me if you’ve heard this before…




                                        page 4
On Orbitz alone we do millions of searches and transactions
 daily, which leads to hundreds of gigabytes of log data
 every day.




                                                        page 5
Hadoop provides us with efficient, economical,
 scalable, and reliable storage and processing of
 these large amounts of data.


 $ per TB




                                                    page 6
And…


Hadoop places no constraints on how data is
 processed.




                                              page 7
Before Hadoop




                page 8
With Hadoop




              page 9
Access to this non-transactional data enables a number of
applications…




                                                            page 10
Optimizing Hotel Search




                          page 11
Recommendations




                  page 12
Page Performance Tracking




                            page 13
Cache Analysis
100.00%
                        72% of queries are                                                        Queries
                        singletons and make up
90.00%                                                                                            Searches
                        nearly a third of total
                        search volume.
80.00%                                                                                            Reverse Running Total
                                                                                                  (Searches)
 71.67%
                                                                                                  Reverse Running Total
70.00%                                                                                            (Queries)


60.00%
                                                                                   A small number of
                                                                                   queries (3%) make
50.00%                                                                             up more than a third
                                                                                   of search volume.
40.00%
                                                           34.30%
 31.87%

30.00%


20.00%


10.00%
                                                          2.78%

 0.00%
          1     2   3       4     5     6     7   8   9      10     11   12   13     14     15    16      17   18     19   20




                                                                                                                           page 14
User Segmentation




                    page 15
All of this is great, but…


Most of these efforts are driven by development
 teams.


The challenge now is to unlock the value in this data
 by making it more available to the rest of the
 organization.




                                                        page 16
“Given the ubiquity of data in modern organizations, a data
warehouse can keep pace today only by being “magnetic”:
attracting all the data sources that crop up within an
organization regardless of data quality niceties.”*




             *MAD Skills: New Analysis Practices for Big Data


                                                              page 17
In a better world…	





                        page 18
Integrating Hadoop with the Enterprise Data Warehouse

                   Robert Lancaster and Jonathan Seidman
                                    Chicago Data Summit
                                           April 26 | 2011
The goal is a unified view of the data, allowing us to use
the power of our existing tools for reporting and analysis.




                                                              page 20
BI vendors are working on integration with Hadoop…




                                                     page 21
And one more reporting tool…




                               page 22
Example Processing Pipeline for Web Analytics Data




                                                     page 23
Aggregating data for import into Data Warehouse




                                                  page 24
Example Use Case: Beta Data Processing




                                     page 25
Example Use Case – Beta Data Processing




                                          page 26
Example Use Case – Beta Data Processing Output




                                                 page 27
Example Use Case: RCDC Processing




                                    page 28
Example Use Case – RCDC Processing




                                     page 29
Example Use Case: Click Data Processing




                                      page 30
Click Data Processing – Current DW Processing




Web
                                           Data
Server	

 Web                                       Cleansing
  Web
 Server	

   Logs   ETL             DW     (Stored            DW
  Servers
                                           procedure)

                          3 hours          2 hours            ~20%
                                                             original
                                                               data
                                                               size



                                                                   page 31
Click Data Processing – New Hadoop Processing




Web                        Data
Server	

 Web                       Cleansing
  Web
 Server	

   Logs   HDFS   (MapReduce)      DW
  Servers




                                                 page 32
Conclusions


•  Market is still immature, but Hadoop has already become a
   valuable business intelligence tool, and will become an
   increasingly important part of a BI infrastructure.
•  Hadoop won’t replace your EDW, but any organization with a
   large EDW should at least be exploring Hadoop as a
   complement to their BI infrastructure.
•  Use Hadoop to offload the time and resource intensive
   processing of large data sets so you can free up your data
   warehouse to serve user needs.
•  The challenge now is making Hadoop more accessible to non-
   developers. Vendors are addressing this, so expect rapid
   advancements in Hadoop accessibility.



                                                                page 33
Oh, and also…


•  Orbitz is looking for a Lead Engineer for the BI/Big Data team.
•  Go to http://careers.orbitz.com/ and search for IRC19035.




                                                                     page 34
References


•  MAD Skills: New Analysis Practices for Big Data, Jeffrey
   Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and
   Caleb Welton, 2009




                                                              page 35

More Related Content

What's hot

Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Hortonworks
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
Hari Shankar Sreekumar
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 

What's hot (20)

Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware Hadoop
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 

Viewers also liked

Your Path to Big Data Sucess
Your Path to Big Data SucessYour Path to Big Data Sucess
Your Path to Big Data Sucess
Cloudera, Inc.
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Viewers also liked (20)

Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
Your Path to Big Data Sucess
Your Path to Big Data SucessYour Path to Big Data Sucess
Your Path to Big Data Sucess
 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small Pockets
 
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIGBotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
 
Integrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big DataIntegrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big Data
 
Edw Data Arc
Edw Data ArcEdw Data Arc
Edw Data Arc
 
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel UptonEDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 

Similar to Extending the EDW with Hadoop - Chicago Data Summit 2011

Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Cloudera, Inc.
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menninger
MassTLC
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menninger
MassTLC
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
Jonathan Seidman
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
Steve Watt
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
butest
 

Similar to Extending the EDW with Hadoop - Chicago Data Summit 2011 (20)

Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menninger
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menninger
 
Big Data
Big DataBig Data
Big Data
 
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
After The Land Grab: 10 Trends Shaping Big Data’s Next Stage
After The Land Grab: 10 Trends Shaping Big Data’s Next StageAfter The Land Grab: 10 Trends Shaping Big Data’s Next Stage
After The Land Grab: 10 Trends Shaping Big Data’s Next Stage
 

More from Jonathan Seidman

Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 

More from Jonathan Seidman (9)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Extending the EDW with Hadoop - Chicago Data Summit 2011

  • 1. Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
  • 2. Who We Are •  Robert Lancaster –  Solutions Architect, Hotel Supply Team –  rlancaster@orbitz.com –  @rob1lancaster •  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User- Group-CHUG) –  jseidman@orbitz.com –  @jseidman page 2
  • 4. Why are we using Hadoop? Stop me if you’ve heard this before… page 4
  • 5. On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day. page 5
  • 6. Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data. $ per TB page 6
  • 7. And… Hadoop places no constraints on how data is processed. page 7
  • 8. Before Hadoop page 8
  • 9. With Hadoop page 9
  • 10. Access to this non-transactional data enables a number of applications… page 10
  • 12. Recommendations page 12
  • 14. Cache Analysis 100.00% 72% of queries are Queries singletons and make up 90.00% Searches nearly a third of total search volume. 80.00% Reverse Running Total (Searches) 71.67% Reverse Running Total 70.00% (Queries) 60.00% A small number of queries (3%) make 50.00% up more than a third of search volume. 40.00% 34.30% 31.87% 30.00% 20.00% 10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 14
  • 15. User Segmentation page 15
  • 16. All of this is great, but… Most of these efforts are driven by development teams. The challenge now is to unlock the value in this data by making it more available to the rest of the organization. page 16
  • 17. “Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data page 17
  • 18. In a better world… page 18
  • 19. Integrating Hadoop with the Enterprise Data Warehouse Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
  • 20. The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis. page 20
  • 21. BI vendors are working on integration with Hadoop… page 21
  • 22. And one more reporting tool… page 22
  • 23. Example Processing Pipeline for Web Analytics Data page 23
  • 24. Aggregating data for import into Data Warehouse page 24
  • 25. Example Use Case: Beta Data Processing page 25
  • 26. Example Use Case – Beta Data Processing page 26
  • 27. Example Use Case – Beta Data Processing Output page 27
  • 28. Example Use Case: RCDC Processing page 28
  • 29. Example Use Case – RCDC Processing page 29
  • 30. Example Use Case: Click Data Processing page 30
  • 31. Click Data Processing – Current DW Processing Web Data Server Web Cleansing Web Server Logs ETL DW (Stored DW Servers procedure) 3 hours 2 hours ~20% original data size page 31
  • 32. Click Data Processing – New Hadoop Processing Web Data Server Web Cleansing Web Server Logs HDFS (MapReduce) DW Servers page 32
  • 33. Conclusions •  Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure. •  Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure. •  Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs. •  The challenge now is making Hadoop more accessible to non- developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility. page 33
  • 34. Oh, and also… •  Orbitz is looking for a Lead Engineer for the BI/Big Data team. •  Go to http://careers.orbitz.com/ and search for IRC19035. page 34
  • 35. References •  MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009 page 35