Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
Extending the Data Warehouse with Hadoop - Hadoop world 2011
1. Extending the Enterprise Data Warehouse with Hadoop
Robert Lancaster and Jonathan Seidman
Hadoop World 2011
November 8 | 2011
2. Who We Are
• Robert Lancaster
– Solutions Architect, Hotel Supply Team
– rlancaster@orbitz.com
– @rob1lancaster
– Co-organizer of Chicago Big Data and Chicago Machine
Learning Study Group.
• Jonathan Seidman
– Lead Engineer, Business Intelligence/Big Data Team
– Co-founder/organizer of Chicago Hadoop User Group and
Chicago Big Data
– jseidman@orbitz.com
– @jseidman
page 2
5. In 2009…
• The Machine Learning team is formed to improve site
performance. For example, improving hotel search results.
• This required access to large volumes of behavioral data for
analysis.
– Fortunately, the required data was collected in session data
stored in web analytics logs.
page 5
6. The Problem…
• The only archive of the required data went back about two
weeks.
Non-transactional Data Transactional data
(e.g. searches) (e.g. bookings) and
aggregated Non-
transactional data
Data Warehouse
page 6
7. Hadoop Provided a Solution…
Detailed non-
transactional data
(what every user sees,
clicks, etc.)
Transactional data
(e.g. bookings) and
aggregated Non-
transactional data
Data Warehouse
Hadoop
page 7
10. But Brought New Challenges…
• Most of these efforts are driven by development teams.
• The challenge now is unlocking the value of this data for non-
technical users.
page 10
11. In Early 2011…
• Big Data team is formed under Business Intelligence team at
Orbitz Worldwide.
• Allows the Big Data team to work more closely with the data
warehouse and BI teams.
• Reflects the importance of big data to the future of the
company.
page 11
12. A View Shared Beyond Orbitz…
“We strongly believe that Hadoop is the nucleus of the next-
generation cloud EDW…”
“…but that promise is still three to five years from fruition.”*
*James Kobielus, Forrester Research,
“Hadoop, Is It Soup Yet?”
page 12
13. Two Primary Ways We Use Hadoop to Complement the EDW
• Extraction and transformation of data for loading into the data
warehouse – “ETL”.
• Off-loading of analysis from the data warehouse.
page 13
15. ETL Example: Click Data Processing
Web
Data
Server
Web Cleansing
Web
Server
Logs ETL DW (Stored DW
Servers
procedure)
Several hours of processing ~20% original
data size
Current Processing in Data Warehouse
page 15
16. ETL Example: Click Data Processing
• Moving to Hadoop will facilitate:
– Remove load from the data warehouse.
– Adding additional attributes for processing.
– Allow processing to be run more frequently.
Web Data
Server
Web Cleansing
Web
Server
Logs Hadoop (MapReduce) DW
Servers
Proposed Processing in Hadoop
page 16
17. Analysis Example: Geo-Targeting Ads
• Facilitated analysis that allows for more personalized ad
content.
• Allowed marketing team to analyze over a years worth of
search data.
• Provided analysis that was difficult to perform in the data
warehouse.
page 17
18. BI Vendors Are Working on Hadoop Integration
Both big (relatively)…
page 18
22. Use Case – Selection Errors: Introduction
• Multiple points of entry.
• Multiple paths through site.
• Goal: tie events together to
form picture of customer
behavior.
page 22
23. Use Case – Selection Errors: Processing
page 23
24. Use Case – Selection Errors: Visualization
page 24
26. Use Case – Beta Data: Introduction
• Hotel Sort Optimization
• Compare A vs. B
• Web Analytics Data
– What user saw.
– How user behaved
• Server Log Data
– Sorting behavior used.
page 26
30. Use Case – RCDC: Introduction
• Understand and improve cache behavior.
• Improve “coverage”
– Traditionally search 1 page of hotels at a time.
– Get “just enough” information to present to consumers.
– Increase amount of availability information we have when
consumer performs a search.
• Data needed to support needs beyond reporting.
page 30
33. Conclusions
• Hadoop market is still immature, but growing quickly. Better
tools are on the way.
– Look beyond the usual (enterprise) suspects. Many of the
most interesting companies in the big data space are small
startups.
• Hadoop will not replace your data warehouse, but any
organization with a large data warehouse should at least be
exploring Hadoop as a complement to their BI infrastructure.
page 33
34. Conclusions
• Work closely with your existing data management teams.
– Your idea of what constitutes big data might quickly
diverge from theirs.
• The flip-side to this is that Hadoop can be an excellent tool to
off-load resource-consuming jobs from your data warehouse.
page 34