Extending the Data Warehouse with Hadoop - Hadoop world 2011

Extending the Enterprise Data Warehouse with Hadoop

Robert Lancaster and Jonathan Seidman
Hadoop World 2011
November 8 | 2011

Who We Are

•  Robert Lancaster
–  Solutions Architect, Hotel Supply Team
–  rlancaster@orbitz.com
–  @rob1lancaster
–  Co-organizer of Chicago Big Data and Chicago Machine
Learning Study Group.
•  Jonathan Seidman
–  Lead Engineer, Business Intelligence/Big Data Team
–  Co-founder/organizer of Chicago Hadoop User Group and
Chicago Big Data
–  jseidman@orbitz.com
–  @jseidman

page 2

Launched in 2001

Over 160 million
bookings

page 3

Some History…

page 4

In 2009…

•  The Machine Learning team is formed to improve site
performance. For example, improving hotel search results.
•  This required access to large volumes of behavioral data for
analysis.
–  Fortunately, the required data was collected in session data
stored in web analytics logs.

page 5

The Problem…

•  The only archive of the required data went back about two
weeks.

Non-transactional Data Transactional data
(e.g. searches) (e.g. bookings) and
aggregated Non-
transactional data

Data Warehouse

page 6

Hadoop Provided a Solution…

Detailed non-
transactional data
(what every user sees,
clicks, etc.)
Transactional data
(e.g. bookings) and
aggregated Non-
transactional data

Data Warehouse

Hadoop

page 7

Deploying Hadoop Enabled Multiple Applications…

100.00%
Queries
90.00%
80.00% Searches
71.67%
70.00%
60.00%
50.00%
40.00% 34.30%
31.87%
30.00%
20.00%
10.00%
2.78%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

page 8

And Useful Analyses…

page 9

But Brought New Challenges…

•  Most of these efforts are driven by development teams.
•  The challenge now is unlocking the value of this data for non-
technical users.

page 10

In Early 2011…

•  Big Data team is formed under Business Intelligence team at
Orbitz Worldwide.
•  Allows the Big Data team to work more closely with the data
warehouse and BI teams.
•  Reflects the importance of big data to the future of the
company.

page 11

A View Shared Beyond Orbitz…

“We strongly believe that Hadoop is the nucleus of the next-
generation cloud EDW…”

“…but that promise is still three to five years from fruition.”*

*James Kobielus, Forrester Research,
“Hadoop, Is It Soup Yet?”

page 12

Two Primary Ways We Use Hadoop to Complement the EDW

•  Extraction and transformation of data for loading into the data
warehouse – “ETL”.
•  Off-loading of analysis from the data warehouse.

page 13

ETL Example: Proposed Dimensional Model

Raw logs Hadoop Dimensional
model

page 14

ETL Example: Click Data Processing

Web
Data
Server

Web Cleansing
Web
Server

Logs ETL DW (Stored DW
Servers
procedure)

Several hours of processing ~20% original
data size

Current Processing in Data Warehouse

page 15

ETL Example: Click Data Processing

•  Moving to Hadoop will facilitate:
–  Remove load from the data warehouse.
–  Adding additional attributes for processing.
–  Allow processing to be run more frequently.

Web Data
Server

Web Cleansing
Web
Server

Logs Hadoop (MapReduce) DW
Servers

Proposed Processing in Hadoop

page 16

Analysis Example: Geo-Targeting Ads

•  Facilitated analysis that allows for more personalized ad
content.
•  Allowed marketing team to analyze over a years worth of
search data.
•  Provided analysis that was difficult to perform in the data
warehouse.

page 17

BI Vendors Are Working on Hadoop Integration

Both big (relatively)…

page 18

And small…

page 19

Example Processing Pipeline for Web Analytics Data

page 20

Example Use Case: Selection Errors

page 21

Use Case – Selection Errors: Introduction

•  Multiple points of entry.
•  Multiple paths through site.
•  Goal: tie events together to
form picture of customer
behavior.

page 22

Use Case – Selection Errors: Processing

page 23

Use Case – Selection Errors: Visualization

page 24

Example Use Case: Beta Data

page 25

Use Case – Beta Data: Introduction

•  Hotel Sort Optimization
•  Compare A vs. B
•  Web Analytics Data
–  What user saw.
–  How user behaved
•  Server Log Data
–  Sorting behavior used.

page 26

Use Case – Beta Data Processing

page 27

Use Case – Beta Data: Visualization

page 28

Example Use Case: RCDC

page 29

Use Case – RCDC: Introduction

•  Understand and improve cache behavior.
•  Improve “coverage”
–  Traditionally search 1 page of hotels at a time.
–  Get “just enough” information to present to consumers.
–  Increase amount of availability information we have when
consumer performs a search.
•  Data needed to support needs beyond reporting.

page 30

Use Case – RCDC: Processing

page 31

Use Case – RCDC: Visualization

page 32

Conclusions

•  Hadoop market is still immature, but growing quickly. Better
tools are on the way.
–  Look beyond the usual (enterprise) suspects. Many of the
most interesting companies in the big data space are small
startups.
•  Hadoop will not replace your data warehouse, but any
organization with a large data warehouse should at least be
exploring Hadoop as a complement to their BI infrastructure.

page 33

Conclusions

•  Work closely with your existing data management teams.
–  Your idea of what constitutes big data might quickly
diverge from theirs.
•  The flip-side to this is that Hadoop can be an excellent tool to
off-load resource-consuming jobs from your data warehouse.

page 34

Extending the Data Warehouse with Hadoop - Hadoop world 2011

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Extending the Data Warehouse with Hadoop - Hadoop world 2011

Similar to Extending the Data Warehouse with Hadoop - Hadoop world 2011 (20)

More from Jonathan Seidman

More from Jonathan Seidman (10)

Recently uploaded

Recently uploaded (20)

Extending the Data Warehouse with Hadoop - Hadoop world 2011