Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop

•

3 gefällt mir•1,268 views

Cloudera, Inc.

Technologie

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011

Who We Are ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],page

page Why are we using Hadoop? Stop me if you’ve heard this before…

page Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data. $ per TB

And… page Hadoop places no constraints on how data is processed.

page Access to this non-transactional data enables a number of applications…

Cache Analysis page A small number of queries (3%) make up more than a third of search volume.

All of this is great, but… ,[object Object],[object Object],page

page “ Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data

Integrating Hadoop with the Enterprise Data Warehouse Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011

page The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis.

page BI vendors are working on integration with Hadoop…

Example Processing Pipeline for Web Analytics Data page

Aggregating data for import into Data Warehouse page

page Example Use Case: Beta Data Processing

Example Use Case – Beta Data Processing page

Example Use Case – Beta Data Processing Output page

page Example Use Case: Click Data Processing

Click Data Processing – Current DW Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size

Click Data Processing – New Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers

Conclusions ,[object Object],[object Object],[object Object],[object Object],page

Oh, and also… ,[object Object],[object Object],page

Empfohlen

NTT Data - Shinichi Yamada - Hadoop World 2010Cloudera, Inc.

Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira

Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy

Integrating hadoop - Big Data TechCon 2013Jonathan Seidman

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit

Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...Kolja Manuel Rödel

Hadoop Reporting and Analysis - JaspersoftHortonworks

What is hadoopAsis Mohanty

Empfohlen

NTT Data - Shinichi Yamada - Hadoop World 2010Cloudera, Inc.

Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira

Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy

Integrating hadoop - Big Data TechCon 2013Jonathan Seidman

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit

Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...Kolja Manuel Rödel

Hadoop Reporting and Analysis - JaspersoftHortonworks

What is hadoopAsis Mohanty

Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit

Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer

Big data architectures and the data lakeJames Serra

Actionable Insights with AI - Snowflake for Data ScienceHarald Erb

Breakout: Hadoop and the Operational Data StoreCloudera, Inc.

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.

Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar

Scalable data pipelineGreenM

2013 march 26_thug_etl_cdc_talking_pointsAdam Muise

Benefits of Hadoop as Platform as a ServiceDataWorks Summit/Hadoop Summit

Hadoop data-lake-white-paperSupratim Ray

Filling the Data LakeDataWorks Summit/Hadoop Summit

Schema-on-Read vs Schema-on-WriteAmr Awadallah

Microsoft Data Platform - What's includedJames Serra

Big Data Architecture and DeploymentCisco Canada

The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.

Hadoop and Hive in Enterprisesmarkgrover

Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies

Microsoft Azure Big Data AnalyticsMark Kromer

ETL big data with apache hadoopMaulik Thaker

Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Cloudera, Inc.

Gartner peer forum sept 2011 orbitzRaghu Kashyap

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit

Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer

Big data architectures and the data lakeJames Serra

Actionable Insights with AI - Snowflake for Data ScienceHarald Erb

Breakout: Hadoop and the Operational Data StoreCloudera, Inc.

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.

Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar

Scalable data pipelineGreenM

2013 march 26_thug_etl_cdc_talking_pointsAdam Muise

Benefits of Hadoop as Platform as a ServiceDataWorks Summit/Hadoop Summit

Hadoop data-lake-white-paperSupratim Ray

Filling the Data LakeDataWorks Summit/Hadoop Summit

Schema-on-Read vs Schema-on-WriteAmr Awadallah

Microsoft Data Platform - What's includedJames Serra

Big Data Architecture and DeploymentCisco Canada

The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.

Hadoop and Hive in Enterprisesmarkgrover

Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies

Microsoft Azure Big Data AnalyticsMark Kromer

ETL big data with apache hadoopMaulik Thaker

Was ist angesagt? (20)

Hadoop Powers Modern Enterprise Data Architectures

Hadoop Integration into Data Warehousing Architectures

Big data architectures and the data lake

Actionable Insights with AI - Snowflake for Data Science

Breakout: Hadoop and the Operational Data Store

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...

Scalable data pipeline

2013 march 26_thug_etl_cdc_talking_points

Benefits of Hadoop as Platform as a Service

Hadoop data-lake-white-paper

Filling the Data Lake

Schema-on-Read vs Schema-on-Write

Microsoft Data Platform - What's included

Big Data Architecture and Deployment

The Future of Data Warehousing: ETL Will Never be the Same

Hadoop and Hive in Enterprises

Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson

Microsoft Azure Big Data Analytics

ETL big data with apache hadoop

Ähnlich wie Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop

Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Cloudera, Inc.

Gartner peer forum sept 2011 orbitzRaghu Kashyap

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman

Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman

FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)GeeksLab Odessa

The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks

Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02email2jl

Creating a Next-Generation Big Data ArchitecturePerficient, Inc.

AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)Amazon Web Services

Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko

DWH & big data architecture approachesLuxoft

Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk

Big data introduction, Hadoop in detailsMahmoud Yassin

BIG Data & Hadoop Applications in Social MediaSkillspeed

Building a Big Data SolutionJames Serra

Big Data and Enterprise Data - Oracle -1663869Edgar Alejandro Villegas

Overcoming Today's Data Challenges with MongoDBMongoDB

Hadoop Demo eConvergencekvnnrao

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks

HadoopVeera Sundari

Ähnlich wie Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop (20)

Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...

Gartner peer forum sept 2011 orbitz

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011

Extending the Data Warehouse with Hadoop - Hadoop world 2011

FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)

The Value of the Modern Data Architecture with Apache Hadoop and Teradata

Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02

Creating a Next-Generation Big Data Architecture

AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Владимир Слободянюк «DWH & BigData – architecture approaches»

DWH & big data architecture approaches

Lecture 5 - Big Data and Hadoop Intro.ppt

Big data introduction, Hadoop in details

BIG Data & Hadoop Applications in Social Media

Building a Big Data Solution

Big Data and Enterprise Data - Oracle -1663869

Overcoming Today's Data Challenges with MongoDB

Hadoop Demo eConvergence

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...

Hadoop

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Kürzlich hochgeladen

Sample pptx for embedding into website for demoHarshalMandlekar2

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Gen AI in Business - Global Trends Report 2024.pdfAddepto

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Rise of the Machines: Known As Drones...Rick Flair

Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5

"ML in Production",Oleksandr BaganFwdays

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Kürzlich hochgeladen (20)

Sample pptx for embedding into website for demo

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

How AI, OpenAI, and ChatGPT impact business and software.

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Gen AI in Business - Global Trends Report 2024.pdf

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Moving Beyond Passwords: FIDO Paris Seminar.pdf

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

"Debugging python applications inside k8s environment", Andrii Soldatenko

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Rise of the Machines: Known As Drones...

Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Time Series Foundation Models - current state and future directions

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...

"ML in Production",Oleksandr Bagan

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Ensuring Technical Readiness For Copilot in Microsoft 365

Nell’iperspazio con Rocket: il Framework Web di Rust!

Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop

1. Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011

3. page Launched: 2001, Chicago, IL

4. page Why are we using Hadoop? Stop me if you’ve heard this before…

6. page Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data. $ per TB

7. And… page Hadoop places no constraints on how data is processed.

8. Before Hadoop page

9. page With Hadoop

10. page Access to this non-transactional data enables a number of applications…

11. Optimizing Hotel Search page

12. Recommendations page

13. Page Performance Tracking page

14. Cache Analysis page A small number of queries (3%) make up more than a third of search volume.

15. User Segmentation page

16.

17. page “ Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data

18. page In a better world…

19. Integrating Hadoop with the Enterprise Data Warehouse Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011

20. page The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis.

21. page BI vendors are working on integration with Hadoop…

22. And one more reporting tool… page

23. Example Processing Pipeline for Web Analytics Data page

24. Aggregating data for import into Data Warehouse page

25. page Example Use Case: Beta Data Processing

26. Example Use Case – Beta Data Processing page

27. Example Use Case – Beta Data Processing Output page

28. page Example Use Case: RCDC Processing

29. Example Use Case – RCDC Processing page

30. page Example Use Case: Click Data Processing

31. Click Data Processing – Current DW Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size

32. Click Data Processing – New Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers

33.

34.

35.

Hinweis der Redaktion

Most people think of orbitz.com, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
Some benefits of Hadoop you start to hear so many times they almost become cliché, but based on our experience at Orbitz they’ve proven to be true, so they bear repeating.
On Orbitz alone we do millions of searches and transactions daily, all of this activity leads to extremely large volumes of data – hundreds of GB/day. Not all of this data has value – much of it’s logged for historic reasons and is no longer useful, but much of it is valuable. In addition there’s more data that we’re not currently capturing that we know has value
This chart isn’t exactly an apples-to-apples comparison, but provides some idea of the difference in cost per TB for the DW vs. Hadoop Hadoop doesn’t provide the same functionality as a data warehouse, but it does allow us to store and process data that wasn’t practical before for economic and technical reasons.
Putting data into a DB or DWH requires having knowledge or making assumptions about how the data will be used. Either way you’re putting constraints around how the data is accessed and processed. With Hadoop each application can process the raw data in whatever way is required.
Our data warehouse contains a full archive of all transactions – every booking, refund, cancellation etc. Much valuable non-transactional data was just thrown away because it was uneconomical to store and didn’t necessarily have clear value.
Hadoop was deployed late 2009/early 2010 to begin collecting this non-transactional data. Orbitz has been using CDH for that entire period with great success. Much of this non-transactional data is contained in web analytics logs.
Having access to this data allows us to perform processing and analyses not previously possible.
Hadoop was first used to facilitate the machine learning teams work. This team needed accessed to large amounts of data on user interaction in order to do things like optimize hotel ranking and show consumers hotels more closely matching their preferences.
Hadoop is used to crunch data for input to a system to recommend products to users.
Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources.
Hadoop collects and processes data for input to analyses to optimize cache performance.
Data is used analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users.
MAD: acronym for magnetic, agile, and deep agile: ability to quickly integrate new data sources deep: able to perform sophisticated analyses
This would facilitate access to all of our data through standard BI tools plus which most of our BI developers, not to mention users, develop SQL, ETL, etc, and are not Java developers and won’t be writing MR jobs we haven’t yet achieved this data warehouse nirvana
QlikView is used extensively for reporting at Orbitz. Although QlikView is working on enhancements to facilitate integration with tools such as Hadoop, there’s no direct integration. This is understandable since QlikView uses an in-memory model which presents a challenge when dealing with Hadoop sized data. We can however use Hadoop to summarize data for export to QlikView.
This provides an example of a typical processing flow for the large volumes of non-transactional data we’re collecting. This processing allows us to convert large volumes of un-structured data into structured data that can be queried, extracted, etc. for further processing.
This type of processing also allows us summarize large volumes of data into a data set that can be exported to the data warehouse, allowing us to query and report on that data using all of our standard BI tools.
Still being implemented, but a good example of how Hadoop allows us to offload time and resource intensive processing from the data warehouse.
Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.