SlideShare ist ein Scribd-Unternehmen logo
1 von 24
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING
USING HADOOP
MICHELLE UFFORD
DATA PLATFORM @ GODADDY
10 JUNE 2015
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING
IN HADOOP IS
EXTREMELY POWERFUL
2
Customer Sales
Web
Clickstream
Social
Media
Business Value
Data Set
Moving our data warehouse to Hadoop enabled greater data
integration & allowed us to support all phases of analytics
Product
Usage
Structured
Semi-
Structured
Unstructured
Descriptive
Diagnostic
Predictive
Prescriptive
AnalyticsAscendencyModel Today’s Agenda
• Project motivations
• Team principles
• Design patterns
• Batch processing
• New insights
• Tips & suggestions
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
PROJECT MOTIVATIONS
OR, “ARE YOU CRAZY?!”
3
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Sensors
Events
Logs
FEEDS
DATA VISUALIZATION
Tableau Excel
StormSpark
STREAM PROCESSING
MySQLSQL Server
CORPORATE DATA
Public
EXTERNAL DATA SOURCES
ElasticSearchCassandra
SERVING PLATFORM
MySQL SQL Server
Teradata Partnerships Subscriptions
Google Analytics
APPLICATIONS & SERVICES
OpenStack
Personalization Hosting WSB WebPro Etc.Search
Fast-DB Sync Secure Ingress
DataCollector
Pigsty
Ad Hocs
Kafka
DATA PLATFORM @ GODADDY
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING @ GODADDY
OR, “THE METHOD TO OUR MADNESS”
5
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TEAM PRINCIPLES
• data should be easy to
• discover
• understand
• consume
• maintain
• favor simplicity in both design
& process
• automate!
HOW WE GET WORK DONE
MAKE IT EASY
• weekly Agile sprints
• focus incessantly on
business needs & value
• be flexible
• deliver quickly
• ‘Data First’ design
• data quality is critical to adoption
• automated data quality tests
• visibility into failures & warnings
• self-healing design
DELIVER VALUE ENSURE QUALITY
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSE – DESIGN DECISIONS
TO SUPPORT ALL TYPES OF ANALYTICS
• Basically, a variant of Kimball
• Wide, denormalized facts
• Minimize “reference tables” and “flat dimensions” to reduce the need for expensive joins
• Integrated, conformed dimensions
• Maintain data at the lowest granularity (including transactional, when available)
• Preserve source data in full fidelity
• Minimize derived data (i.e. birthdate vs. age)
• Type 4 SCD (“history table”)
• Natural keys (instead of surrogate keys)
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE DIMENSION TABLE – TRANSACTIONAL
8
DIM_CUSTOMER_TX
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
2015-06-01 9:00 AM {ecomm, 2015-06-01 8:55 AM, update} False 111 2007-07-02 N/A
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 111 2007-07-02 USA US
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE DIMENSION TABLE – SNAPSHOT
9
DIM_CUSTOMER
ETL Timestamp
Customer ID
(Natural Key)
Active
Customer
Flag
Original
Gender
First Order
Date
Original
Country
Code
Country
Code
2015-06-01 10:00 AM 111 False 2007-07-02 USA US
2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA
2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TRANSACTIONAL VS. SNAPSHOT
• Mutable “snapshot” that rolls up transactions
• Unique on [Natural Key]
• May either use logical deletion or exclude deleted records
• Sourced from the processed, transactional table
• Populated using an automated snapshotting process
• Replaces the prior snapshot each time it executes
• Automates complexity
• Provides historical visibility via “archives”
• Default data source for most queries & reports
• Optimized for querying
• Immutable, append-only
• Unique on [Transaction Timestamp + Natural Key]
• Records have logical deletion indicators
• Sourced from raw imported data
• Populated by Pig script (data engineer)
• New data is always appended
• Minimizes complexity
• Provides dynamic “point-in-time” query functionality
• Typically used for PA, ML, & SOX
• Optimized for ETL processes
TRANSACTIONAL TABLE SNAPSHOT TABLE
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE FACT TABLE – APPEND-ONLY
11
FACT_USER_EVENT
ETL Timestamp Event Timestamp Event Type Customer ID Location Event JSON
2015-06-01 9:00 AM 2015-06-01 8:55 AM wsb.login 111 UK {“event”:”wsb.login”,“cust_id":111,”wsb_id”:579}
2015-06-01 9:15 AM 2015-06-01 9:14 AM call.inbound 222 IN {“event”:”call.inbound”,“cust_id":222,”rep_id”:25}
2015-06-01 9:15 AM 2015-06-01 9:14 AM account.create 333 US {“event”:”account.create”,“cust_id":333}
2015-06-01 9:30 AM 2015-06-01 9:22 AM wsb.config 111 UK {“event”:”wsb.convig”,“cust_id":111,”wsb_id”:579}
2015-06-01 9:45 AM 2015-06-01 9:37 AM o365.provision 222 IN {“event”:”o365.provision”,“cust_id":222,”rep_id”:25}
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING
OR, “MAKING IT ALL HAPPEN”
12
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Enterprise Data Layer (data warehouse)
Data Ingress Layer (raw data)
Stage VMs
Transactional Table (tx)
HDFS
Raw Event Data Snapshot Table (snap) External Data
Data Consumption Layer (user / team)
Hourly Snapshot (View)
Kafka Fast-DB Sync
Transactional Table (tx) Snapshot Table (snap) Hourly Snapshot (view)
HDFS DRILL-DOWN
LOGICAL DATA LAYERS
Integrated Data Pre-Aggregated Data Transformed Data
Append-Only Data
Logical construct only!
Users & processes can consume from any layer
Pig
Pig -or- Hive
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING DRILL-DOWN
14
INCREMENTAL PATTERN
Enterprise Data Layer (data warehouse)
Data Ingress Layer (raw data)
HDFS
Data Consumption Layer (user / team)
Pigsty
• next(tx_date) = $date
• foreach destination server
prep()
• execute(script.pig)
1. filter transactional source
(tx_date=$date)
2. store transactional to HDFS
(tx_date=$date)
3. store aggregations to HDFS
(tx_date=$date)
4. store to destination server(s)
• execute(data_quality_tests)
• if (tests=pass)
merge(destination)
• replace(dim_customer_snapshot)
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
customer
(snapshot)
dim_customer
(snapshot)
SERVING PLATFORM
MySQL SQL Server Cassandra
data-ingress / ecomm / customer_tx / tx_date=20150601
data-mgmt / dim_customer_tx / tx_date=20150601
data-rpt / new_customers / tx_date=20150601
/ tx_date=20150602
/ tx_date=20150602
/ tx_date=20150602
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – PRE-DEPLOYMENT
15
DIM_CUSTOMER_TX – TRANSACTIONAL
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT
16
DIM_CUSTOMER_TX – TRANSACTIONAL
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
Gender
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:14 AM, deploy} False 222 Female 2010-06-06 CA CA Female
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:22 AM, deploy} False 333 Male 2015-06-01 blah N/A Male
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:37 AM, deploy} True 111 2007-07-02 USA US Female
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT
17
DIM_CUSTOMER – SNAPSHOT
ETL Timestamp
Customer ID
(Natural Key)
Active
Customer
Flag
Original
Gender
First Order
Date
Original
Country
Code
Country
Code
Gender
2015-06-01 10:00 AM 111 False 2007-07-02 USA US Female
2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA Female
2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Male
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
REAL BUSINESS RESULTS
OR, “HADOOP: NOT JUST HYPE!”
18
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
HADOOP ENABLES GREATER & QUICKER DW VALUE
• Better use of data engineers
• data ingress is largely automated
• reduces (not eliminates) the traditional 75-80% of project time spent on ETL
• Well-suited for Agile
• full source data is preserved in full fidelity
• minimizes permanence of design decisions
• roll out changes weekly
• Data integration
• access to the other 79.7% of the company’s data
• flexible data models using complex data types
• Single source of data processing
• Process once => export 0:N destinations; supports all data consumers
• Frees up expensive database resources
• Single enterprise solution for data quality, monitoring, etc.
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
PROCESSED DATA HAS GREATER REACH
Descriptive
What has / is
happening?
Diagnostic
Why did it
happen?
Predictive
What will likely
happen?
Prescriptive
How can we
make it happen?
The same attributes used for
reporting can be inputs into
PA models & ML algorithms!
primarily uses
snapshots
primarily uses
transactional
uses both
snapshots &
transactional
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Enterprise data
+
Clickstream dataChurn
Analysis
Customer
Dashboard
New
Attributes
Sentiment
Analysis
UNLOCKING NEW, ACTIONABLE INSIGHTS
21
Customer
Experience
Business
Value
Complex Data
Complex Analysis
Enterprise data
+
Product data
+
Event data
Enterprise data
+
External data
=
New enterprise data
Enterprise data
+
Call transcripts
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TIPS & LESSONS LEARNED
OR, “HOW TO BE AWESOME”
22
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
SUGGESTIONS TO IMPROVE YOUR HADOOP DW PROJECT
• Standardize on technology Pig for ETL, Hive for analysis
• Focus on simplicity if your data isn’t easy to use, you’ve failed
• Embrace flexibility don’t shy away from complex data types
• Be predictable use HCatalog & consistent naming standard
• Don’t be afraid of change use data abstraction to minimize impact to consumers
• Do quick prototyping use external tables & data extractions via Hive ODBC
• Democratize data amazing insights can come from anywhere; embrace new data consumers
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
QUESTIONS?
THANK YOU FOR ATTENDING! 

Weitere ähnliche Inhalte

Was ist angesagt?

DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)Blake Irvine
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiBrian Olsen
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Life is but a Stream
Life is but a StreamLife is but a Stream
Life is but a StreamDatabricks
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixJerome Boulon
 
Building a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineBuilding a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineDataWorks Summit
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowGary Stafford
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark Summit
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore
 

Was ist angesagt? (20)

DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Life is but a Stream
Life is but a StreamLife is but a Stream
Life is but a Stream
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
Building a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineBuilding a Self-Service Big Data Pipeline
Building a Self-Service Big Data Pipeline
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 

Andere mochten auch

Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixData Con LA
 
Netflix-Using analytics to predict hits
Netflix-Using analytics to predict hitsNetflix-Using analytics to predict hits
Netflix-Using analytics to predict hitsGaurav Dutta
 
Use of Analytics by Netflix - Case Study
Use of Analytics by Netflix - Case StudyUse of Analytics by Netflix - Case Study
Use of Analytics by Netflix - Case StudySaket Toshniwal
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsBlake Irvine
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programminghye-jin-lee
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Sudhir Tonse
 
Building an Automated Database Deployment Pipeline
Building an Automated Database Deployment PipelineBuilding an Automated Database Deployment Pipeline
Building an Automated Database Deployment PipelineGrant Fritchey
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Guide to OKR (Objectives & Key Results)
Guide to OKR (Objectives & Key Results)Guide to OKR (Objectives & Key Results)
Guide to OKR (Objectives & Key Results)Mustansir Husain
 

Andere mochten auch (11)

Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Netflix-Using analytics to predict hits
Netflix-Using analytics to predict hitsNetflix-Using analytics to predict hits
Netflix-Using analytics to predict hits
 
Use of Analytics by Netflix - Case Study
Use of Analytics by Netflix - Case StudyUse of Analytics by Netflix - Case Study
Use of Analytics by Netflix - Case Study
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of Analytics
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
 
Building an Automated Database Deployment Pipeline
Building an Automated Database Deployment PipelineBuilding an Automated Database Deployment Pipeline
Building an Automated Database Deployment Pipeline
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Guide to OKR (Objectives & Key Results)
Guide to OKR (Objectives & Key Results)Guide to OKR (Objectives & Key Results)
Guide to OKR (Objectives & Key Results)
 

Ähnlich wie Data Warehousing Patterns for Hadoop

Data Warehousing using Hadoop
Data Warehousing using HadoopData Warehousing using Hadoop
Data Warehousing using HadoopDataWorks Summit
 
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic
 
Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...
Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...
Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...CA Technologies
 
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...CA Technologies
 
Realtime Reporting with GoldenGate
Realtime Reporting with GoldenGateRealtime Reporting with GoldenGate
Realtime Reporting with GoldenGateEmtec Inc.
 
Achieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendAchieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendTalend
 
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...Dobo Radichkov
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
The State of Streaming Analytics: The Need for Speed and Scale
The State of Streaming Analytics: The Need for Speed and ScaleThe State of Streaming Analytics: The Need for Speed and Scale
The State of Streaming Analytics: The Need for Speed and ScaleVoltDB
 
Automate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business ImpactAutomate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business ImpactCA Technologies
 
In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPHBaseCon
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
KScope14 - Real-Time Data Warehouse Upgrade - Success Stories
KScope14 - Real-Time Data Warehouse Upgrade - Success StoriesKScope14 - Real-Time Data Warehouse Upgrade - Success Stories
KScope14 - Real-Time Data Warehouse Upgrade - Success StoriesMichael Rainey
 
Oracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorldOracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorldJeffrey T. Pollock
 
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB
 
Django as a Data Tool in the Enterprise - PyData New York 2015
Django as a Data Tool in the Enterprise - PyData New York 2015Django as a Data Tool in the Enterprise - PyData New York 2015
Django as a Data Tool in the Enterprise - PyData New York 2015trentoliphant
 
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB
 

Ähnlich wie Data Warehousing Patterns for Hadoop (20)

Data Warehousing using Hadoop
Data Warehousing using HadoopData Warehousing using Hadoop
Data Warehousing using Hadoop
 
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
 
Hadoop @ LifeWay
Hadoop @ LifeWayHadoop @ LifeWay
Hadoop @ LifeWay
 
Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...
Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...
Case Study: Manheim Implements Test Data Management to Reduce Testing Time an...
 
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Realtime Reporting with GoldenGate
Realtime Reporting with GoldenGateRealtime Reporting with GoldenGate
Realtime Reporting with GoldenGate
 
Achieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendAchieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - Talend
 
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
The State of Streaming Analytics: The Need for Speed and Scale
The State of Streaming Analytics: The Need for Speed and ScaleThe State of Streaming Analytics: The Need for Speed and Scale
The State of Streaming Analytics: The Need for Speed and Scale
 
Automate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business ImpactAutomate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business Impact
 
Select * From Internet
Select * From InternetSelect * From Internet
Select * From Internet
 
In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAP
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
KScope14 - Real-Time Data Warehouse Upgrade - Success Stories
KScope14 - Real-Time Data Warehouse Upgrade - Success StoriesKScope14 - Real-Time Data Warehouse Upgrade - Success Stories
KScope14 - Real-Time Data Warehouse Upgrade - Success Stories
 
Oracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorldOracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorld
 
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
 
Django as a Data Tool in the Enterprise - PyData New York 2015
Django as a Data Tool in the Enterprise - PyData New York 2015Django as a Data Tool in the Enterprise - PyData New York 2015
Django as a Data Tool in the Enterprise - PyData New York 2015
 
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Data Warehousing Patterns for Hadoop

  • 1. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING USING HADOOP MICHELLE UFFORD DATA PLATFORM @ GODADDY 10 JUNE 2015
  • 2. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING IN HADOOP IS EXTREMELY POWERFUL 2 Customer Sales Web Clickstream Social Media Business Value Data Set Moving our data warehouse to Hadoop enabled greater data integration & allowed us to support all phases of analytics Product Usage Structured Semi- Structured Unstructured Descriptive Diagnostic Predictive Prescriptive AnalyticsAscendencyModel Today’s Agenda • Project motivations • Team principles • Design patterns • Batch processing • New insights • Tips & suggestions
  • 3. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. PROJECT MOTIVATIONS OR, “ARE YOU CRAZY?!” 3
  • 4. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Sensors Events Logs FEEDS DATA VISUALIZATION Tableau Excel StormSpark STREAM PROCESSING MySQLSQL Server CORPORATE DATA Public EXTERNAL DATA SOURCES ElasticSearchCassandra SERVING PLATFORM MySQL SQL Server Teradata Partnerships Subscriptions Google Analytics APPLICATIONS & SERVICES OpenStack Personalization Hosting WSB WebPro Etc.Search Fast-DB Sync Secure Ingress DataCollector Pigsty Ad Hocs Kafka DATA PLATFORM @ GODADDY
  • 5. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING @ GODADDY OR, “THE METHOD TO OUR MADNESS” 5
  • 6. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TEAM PRINCIPLES • data should be easy to • discover • understand • consume • maintain • favor simplicity in both design & process • automate! HOW WE GET WORK DONE MAKE IT EASY • weekly Agile sprints • focus incessantly on business needs & value • be flexible • deliver quickly • ‘Data First’ design • data quality is critical to adoption • automated data quality tests • visibility into failures & warnings • self-healing design DELIVER VALUE ENSURE QUALITY
  • 7. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSE – DESIGN DECISIONS TO SUPPORT ALL TYPES OF ANALYTICS • Basically, a variant of Kimball • Wide, denormalized facts • Minimize “reference tables” and “flat dimensions” to reduce the need for expensive joins • Integrated, conformed dimensions • Maintain data at the lowest granularity (including transactional, when available) • Preserve source data in full fidelity • Minimize derived data (i.e. birthdate vs. age) • Type 4 SCD (“history table”) • Natural keys (instead of surrogate keys)
  • 8. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE DIMENSION TABLE – TRANSACTIONAL 8 DIM_CUSTOMER_TX ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code 2015-06-01 9:00 AM {ecomm, 2015-06-01 8:55 AM, update} False 111 2007-07-02 N/A 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 111 2007-07-02 USA US 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 9. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE DIMENSION TABLE – SNAPSHOT 9 DIM_CUSTOMER ETL Timestamp Customer ID (Natural Key) Active Customer Flag Original Gender First Order Date Original Country Code Country Code 2015-06-01 10:00 AM 111 False 2007-07-02 USA US 2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA 2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 10. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TRANSACTIONAL VS. SNAPSHOT • Mutable “snapshot” that rolls up transactions • Unique on [Natural Key] • May either use logical deletion or exclude deleted records • Sourced from the processed, transactional table • Populated using an automated snapshotting process • Replaces the prior snapshot each time it executes • Automates complexity • Provides historical visibility via “archives” • Default data source for most queries & reports • Optimized for querying • Immutable, append-only • Unique on [Transaction Timestamp + Natural Key] • Records have logical deletion indicators • Sourced from raw imported data • Populated by Pig script (data engineer) • New data is always appended • Minimizes complexity • Provides dynamic “point-in-time” query functionality • Typically used for PA, ML, & SOX • Optimized for ETL processes TRANSACTIONAL TABLE SNAPSHOT TABLE
  • 11. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE FACT TABLE – APPEND-ONLY 11 FACT_USER_EVENT ETL Timestamp Event Timestamp Event Type Customer ID Location Event JSON 2015-06-01 9:00 AM 2015-06-01 8:55 AM wsb.login 111 UK {“event”:”wsb.login”,“cust_id":111,”wsb_id”:579} 2015-06-01 9:15 AM 2015-06-01 9:14 AM call.inbound 222 IN {“event”:”call.inbound”,“cust_id":222,”rep_id”:25} 2015-06-01 9:15 AM 2015-06-01 9:14 AM account.create 333 US {“event”:”account.create”,“cust_id":333} 2015-06-01 9:30 AM 2015-06-01 9:22 AM wsb.config 111 UK {“event”:”wsb.convig”,“cust_id":111,”wsb_id”:579} 2015-06-01 9:45 AM 2015-06-01 9:37 AM o365.provision 222 IN {“event”:”o365.provision”,“cust_id":222,”rep_id”:25} Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 12. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING OR, “MAKING IT ALL HAPPEN” 12
  • 13. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Enterprise Data Layer (data warehouse) Data Ingress Layer (raw data) Stage VMs Transactional Table (tx) HDFS Raw Event Data Snapshot Table (snap) External Data Data Consumption Layer (user / team) Hourly Snapshot (View) Kafka Fast-DB Sync Transactional Table (tx) Snapshot Table (snap) Hourly Snapshot (view) HDFS DRILL-DOWN LOGICAL DATA LAYERS Integrated Data Pre-Aggregated Data Transformed Data Append-Only Data Logical construct only! Users & processes can consume from any layer Pig Pig -or- Hive
  • 14. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING DRILL-DOWN 14 INCREMENTAL PATTERN Enterprise Data Layer (data warehouse) Data Ingress Layer (raw data) HDFS Data Consumption Layer (user / team) Pigsty • next(tx_date) = $date • foreach destination server prep() • execute(script.pig) 1. filter transactional source (tx_date=$date) 2. store transactional to HDFS (tx_date=$date) 3. store aggregations to HDFS (tx_date=$date) 4. store to destination server(s) • execute(data_quality_tests) • if (tests=pass) merge(destination) • replace(dim_customer_snapshot) Obligatory Disclaimer: this is fictitious data used for demonstration purposes only customer (snapshot) dim_customer (snapshot) SERVING PLATFORM MySQL SQL Server Cassandra data-ingress / ecomm / customer_tx / tx_date=20150601 data-mgmt / dim_customer_tx / tx_date=20150601 data-rpt / new_customers / tx_date=20150601 / tx_date=20150602 / tx_date=20150602 / tx_date=20150602
  • 15. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – PRE-DEPLOYMENT 15 DIM_CUSTOMER_TX – TRANSACTIONAL ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 16. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT 16 DIM_CUSTOMER_TX – TRANSACTIONAL ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code Gender 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:14 AM, deploy} False 222 Female 2010-06-06 CA CA Female 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:22 AM, deploy} False 333 Male 2015-06-01 blah N/A Male 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:37 AM, deploy} True 111 2007-07-02 USA US Female Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 17. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT 17 DIM_CUSTOMER – SNAPSHOT ETL Timestamp Customer ID (Natural Key) Active Customer Flag Original Gender First Order Date Original Country Code Country Code Gender 2015-06-01 10:00 AM 111 False 2007-07-02 USA US Female 2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA Female 2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Male Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 18. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. REAL BUSINESS RESULTS OR, “HADOOP: NOT JUST HYPE!” 18
  • 19. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. HADOOP ENABLES GREATER & QUICKER DW VALUE • Better use of data engineers • data ingress is largely automated • reduces (not eliminates) the traditional 75-80% of project time spent on ETL • Well-suited for Agile • full source data is preserved in full fidelity • minimizes permanence of design decisions • roll out changes weekly • Data integration • access to the other 79.7% of the company’s data • flexible data models using complex data types • Single source of data processing • Process once => export 0:N destinations; supports all data consumers • Frees up expensive database resources • Single enterprise solution for data quality, monitoring, etc.
  • 20. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. PROCESSED DATA HAS GREATER REACH Descriptive What has / is happening? Diagnostic Why did it happen? Predictive What will likely happen? Prescriptive How can we make it happen? The same attributes used for reporting can be inputs into PA models & ML algorithms! primarily uses snapshots primarily uses transactional uses both snapshots & transactional
  • 21. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Enterprise data + Clickstream dataChurn Analysis Customer Dashboard New Attributes Sentiment Analysis UNLOCKING NEW, ACTIONABLE INSIGHTS 21 Customer Experience Business Value Complex Data Complex Analysis Enterprise data + Product data + Event data Enterprise data + External data = New enterprise data Enterprise data + Call transcripts
  • 22. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TIPS & LESSONS LEARNED OR, “HOW TO BE AWESOME” 22
  • 23. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. SUGGESTIONS TO IMPROVE YOUR HADOOP DW PROJECT • Standardize on technology Pig for ETL, Hive for analysis • Focus on simplicity if your data isn’t easy to use, you’ve failed • Embrace flexibility don’t shy away from complex data types • Be predictable use HCatalog & consistent naming standard • Don’t be afraid of change use data abstraction to minimize impact to consumers • Do quick prototyping use external tables & data extractions via Hive ODBC • Democratize data amazing insights can come from anywhere; embrace new data consumers
  • 24. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. QUESTIONS? THANK YOU FOR ATTENDING! 