SlideShare a Scribd company logo
1 of 47
Download to read offline
Building a Data Warehouse for
Business Analytics using Spark SQL
Copyright Edmunds.com, Inc. (the “Company”). Edmunds and the Edmunds.com logo are registered
trademarks of the Company. This document contains proprietary and/or confidential information of the
Company. No part of this document or the information it contains may be used, or disclosed to any
person or entity, for any purpose other than advancing the best interests of the Company, and any
such disclosure requires the express approval of the Company. 
Blagoy Kaloferov
Software Engineer
06/15/2015
About me:
Blagoy Kaloferov
Big Data Software Engineer
About my company:
Edmunds.com is a car buying platform
18M+ unique visitors each month
Today’s talk
“It’s about time!”
 


Agenda
1.  Introduction and Architecture
2.  Building Next Gen DWH
3.  Automating Ad Revenue using Spark SQL
4.  Conclusion
Business Analytics at
Edmunds.com
Divisions:
DWH Engineers
Business Analysts
Statistics team
Two major groups:
Map Reduce / Spark developers
Analysts with advanced SQL skills
Spark SQL :
•  Simplified ETL and enhanced Visualization tools
•  Allows anyone in BA to quickly build new Data marts
•  Enabled a scalable POC to Production process for our
projects
Proposition
Data Ingestions / ETL
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
DWH Developers Business Analyst
Hadoop Cluster
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Spark
Spark SQL
Databricks
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Spark
Spark SQL
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Spark
Spark SQL
ETL
Our approach:
o  Spark SQL tables similar to
existing our Redshift tables
o  Best fit for us are Hive tables
pointing to S3 delimited data
o  Exposed hundreds of Spark SQL
tables
Exposing S3 data via
Spark SQL
•  S3 Datasets have thousands of directories:
( location	
  /year/month/day/hour	
  )
•  Every new S3 directory for each dataset has to be registered
Adding Latest Table Partitions
Adding Latest Table Partitions
Spark SQL tables
S3 dataset
S3 dataset
S3 dataset
S3 dataset
S3 dataset
S3 dataset
S3 dataset2015/05/31/01
2015/05/31/02
Spark SQL table
partitions
Utilities to for Spark SQL tables and S3:
•  Register valid partitions of Spark SQL tables with spark Hive Metastore
•  Create Last_X_ Days copy of any Spark SQL table in memory
Scheduled jobs:
•  Registers latest available directories for all Spark SQL tables
programmatically
•  Updates Last_3_ Days of core datasets in memory
Adding Latest Table PartitionsSpark SQL tables
“It’s about time!”
 


S3 and Spark SQL potential
o  Now that all S3 data is easily
accessible, there are a lot of
opportunities !
o  Anyone can ETL on prefixed
aggregates and create new Data Marts
Spark Cluster
Spark SQL tables
Last_3_days Tables
Utilities and UDF’s
Business
Intelligence
Tools
Faster Pipeline
Better Insights
Platfora Dashboards Pipeline
Optimization
o  Platfora is a Visualization Analytics Tool
o  Provides More than 200 dashboards for BA
o  Uses MapReduce to load aggregates
Source Dataset Build / Update Lens Dashboards
HDFS
S3
Joined
Datasets
MapReduce
jobs
Platfora Dashboards Pipeline
Optimization
Limitations:
o  We can not optimize the Platfora Map Reduce jobs
o  Defined Data Marts not available elsewhere
Source Dataset Build / Update Lens Dashboards
HDFS
S3
Joined
Datasets
MapReduce
jobs
Join	
  on	
  lead_id	
  ,	
  inventory_id,	
  	
  
visitor_id,	
  dealer_id	
  …	
  
	
  
Platfora Dealer Leads dataset
Use Case
o  Dealer Leads: Lead Submitter insights dataset
o  More than 40 Visual Dashboards are using Dealer Leads
Lead
Submitter Data
Region
Info
Transaction
Data
Vehicle
Info
Dealer
Info
Dealer Leads
Joined Dataset
Lead Categorization
Lead
Submitter
Insights
join
“It’s about time!”
 


Optimizing Dealer Leads Dataset
Dealer Leads Platfora Dataset stats:
o  300+ attributes
o  Usually takes 2-3 hours to build lens
o  Scheduled to build daily
“It’s about time!”
 


Optimizing Dealer Leads Dataset
How do we optimize it?
“It’s about time!”
 


Optimizing Dealer Leads Dataset
How do we optimize it?
1. Have Spark SQL do the work!
o  All required datasets are exposed as Spark SQL tables
o  Add new useful attributes
2. Make the ETL easy for anyone in
Business Analytics to do it themselves
o  Provide utilities and UDF’s so that aggregated data can
be exposed to Visualization tools
Dealer Leads Using Spark SQL
Demo
Dealer Leads Data Mart using Spark SQL
Demo
Expose all original 300+ attributes
Enhance: Join with site_traffic
Dealer Leads
Dataset
Traffic Data
Lead submitter journey
Entry page, page views, device …
aggregate_traffic_spark_sql
Dealer Leads Using Spark SQL
results
o  Spark SQL aggregation in 10 minutes.
o  Adds dimension attributes that were not available before
o  Platfora does not need to join aggregates
o  Significantly reduced latency
o  Dashboard refreshed every 2 hours instead of once per day.
Spark SQL
Dealer Leads
Lens
10 minutes 10 minutes
Dashboards
ETL and Visualization
takeaway
o  Now anyone in BA can perform and support ETL on their own
o  New Data marts can be exported to RDBMS
S3
New Data Marts
Using Spark SQL Redshift
Platfora
Tableau
Spark Cluster
Spark SQL tables
Last N days Tables
Utilities
Spark SQL
connector
ETL
load
Usual POC process
o  Business Analyst Project Prototype in SQL
o  Not scalable. Takes Ad Hoc resources from RDBMS
o  SQL to Map Reduce
o  Transition from two very different frameworks
o  MR do not always fit complicated business logic.
o  Supported only by Developers
POC with Spark SQL vision
“It’s about time!”
 


POC with Spark SQL vision
new POC process using Spark
o  A developer and BA can work together on the same
platform and collaborate using Spark
o  Its scalable
o  No need to switch frameworks when productionalizing
Ad Revenue Billing Use Case
Definitions:
Impression, CPM
Line Item, Order
Introduction
OEM Advertising on website
Introduction
OEM Advertising on website
Ad Revenue computed at the end of the month
using OEM provided impression data
Ad Revenue Billing Use Case
Impressions served * CPM != actual revenue
o  There are billing adjustment rules!
o  Each OEM has a set of unique rules that
determine the actual revenue.
o  Adjusting revenue numbers requires manual
user inputs from OEM’s Account Manager
Ad Revenue End of Month billing
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
ORIGINAL	
  Line	
  Item	
  	
  |	
  CPM	
  |	
  impressions	
  |	
  a?ributes
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
ORIGINAL	
  Line	
  Item	
  	
  |	
  CPM	
  |	
  impressions	
  |	
  a?ributes
SUPPORT	
  Line	
  Item	
  |	
  CPM	
  |	
  impressions
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
Combine data
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
impressions_served > Contract ?
Line	
  Item	
  	
  |	
  impressions|	
  Contract
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
Line	
  Item	
  	
  |	
  CAPPED_impressions|	
  Contract
Cap impression!
impressions_served > Contract ?
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
impressions_served > (X% * Contract) ?
Line	
  Item	
  	
  |	
  impressions|	
  Contract
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
impressions_served > (X% * Contract) ?
Adjust impression!
Line	
  Item	
  	
  |	
  ADJUSTED_impressions|	
  Contract
1_day_impr	
  |	
  CPM	
  |	
  a?ributes
1d,	
  7d,	
  MTD,	
  QTD,	
  YTD:	
  adjusted_impr	
  |	
  CPM	
  	
  
Billing Engine
Each Line Item
Process Vision:
Impressions served * CPM = actual revenue
Can we automate ad revenue calculation?
Automation Challenges
o  Many rules , user defined inputs, the logic changes
o  Need for scalable unified platform
o  Need for tight collaboration between OEM team,
Business Analysts and DWH developers
Can we automate ad revenue calculation?
Billing Rules Modeling Project
How do we develop it?
Ad Revenue
Billing Rules Modeling Project
How do we develop it?
Spark + Spark SQL approach
BA + Developers + OEM Account Team collaboration
Goal is an Ad Performance Dashboard
Adjusted Billing Ad RevenueOEM:
Project Architecture in Spark
Sum / Transform / Join Rows
Processing separated in phases where input / outputs are
Spark SQL tables
1_day_impr	
  |	
  CPM	
  |	
  a?ributes Spark SQL RowEach Line Item =
Project Architecture in Spark
Base Line Items
Spark SQL
Merged Line Items
Spark SQL
Adjusted Line Items
Spark SQL
Phase 1 Phase 2 Phase 3
Ad Performance
Tableau
Dashboard
Spark SQL
Project Architecture in Spark
Business Analysts
DFP / Other
Impressions
Line Item
Dimensions
Base Line Items
Spark SQL
join
aggregate
Phase 1
Merged Line Items
Spark SQL
Adjusted Line Items
Spark SQL
Phase 2 Phase 3
Ad Performance
Tableau
Dashboard
Spark SQL
Project Architecture in Spark
Business Analysts BA + Developer
DFP / Other
Impressions
Line Item
Dimensions
- Manual Groupings
- Other Inputs
Spark SQL
Line Item
Merging
Engine
OEM Account
Managers
Base Line Items
Spark SQL
Merged Line Items
Spark SQL
join
aggregate
Phase 1 Phase 2
Adjusted Line Items
Spark SQL
Phase 3
Ad Performance
Tableau
Dashboard
Spark SQL
Billing Rules
Engine
Project Architecture in Spark
Business Analysts BA + Developer
DFP / Other
Impressions
Line Item
Dimensions
- Manual Groupings
- Other Inputs
Spark SQL
Line Item
Merging
Engine
Ad Performance
Tableau
Business Analysts
OEM Account
Managers
Base Line Items
Spark SQL
Merged Line Items
Spark SQL
Adjusted Line Items
Spark SQL
join
aggregate
Phase 1 Phase 2 Phase 3 Dashboard
“It’s about time!”
 


Billing Rules Modeling
Achievements
o  Increased accuracy of revenue forecasts for BA
o  Cost savings by not having a dedicated team doing
manual adjustments
o  Monitor ad delivery rate for orders
o  Allows us to detect abnormalities in ad serving
o  Collaboration between BA and DWH Developers
Thank you!
Blagoy Kaloferov
bkaloferov@edmunds.com
Questions?
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

More Related Content

What's hot

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 

What's hot (20)

Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Five Connectivity and Security Use Cases for Azure VNets
Five Connectivity and Security Use Cases for Azure VNetsFive Connectivity and Security Use Cases for Azure VNets
Five Connectivity and Security Use Cases for Azure VNets
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
A presentation for Retail Sales Projects
A presentation for Retail Sales ProjectsA presentation for Retail Sales Projects
A presentation for Retail Sales Projects
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
ABD203_Real-Time Streaming Applications on AWS
ABD203_Real-Time Streaming Applications on AWSABD203_Real-Time Streaming Applications on AWS
ABD203_Real-Time Streaming Applications on AWS
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 

Similar to Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

SharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- TrimantraSharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- Trimantra
Mihir G.
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
Pronovix
 

Similar to Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds) (20)

Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Racing for the Flexibility Integrating Aras into the IT Landscape
Racing for the Flexibility Integrating Aras into the IT LandscapeRacing for the Flexibility Integrating Aras into the IT Landscape
Racing for the Flexibility Integrating Aras into the IT Landscape
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Ebay OLAP Cube
Ebay OLAP CubeEbay OLAP Cube
Ebay OLAP Cube
 
RESUME_RASMI3.6
RESUME_RASMI3.6RESUME_RASMI3.6
RESUME_RASMI3.6
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
SaurabhKasyap
SaurabhKasyapSaurabhKasyap
SaurabhKasyap
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
SharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- TrimantraSharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- Trimantra
 
Rohit Resume
Rohit ResumeRohit Resume
Rohit Resume
 
Make xCommerce fit to you
Make xCommerce fit to youMake xCommerce fit to you
Make xCommerce fit to you
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
 
Oracle Analytics Cloud
Oracle Analytics CloudOracle Analytics Cloud
Oracle Analytics Cloud
 
QuantiQ TEQ Day : Dynamics NAV
QuantiQ TEQ Day : Dynamics NAVQuantiQ TEQ Day : Dynamics NAV
QuantiQ TEQ Day : Dynamics NAV
 
ConnectWise Reporting Tools Overview
ConnectWise Reporting Tools OverviewConnectWise Reporting Tools Overview
ConnectWise Reporting Tools Overview
 
Optimizing Innovation: Modular Toolchains that Enable Digital Transformations
Optimizing Innovation: Modular Toolchains that Enable Digital TransformationsOptimizing Innovation: Modular Toolchains that Enable Digital Transformations
Optimizing Innovation: Modular Toolchains that Enable Digital Transformations
 
Optimizing Innovation- Modular Toolchains that Enable Digital Transformations
Optimizing Innovation-  Modular Toolchains that Enable Digital TransformationsOptimizing Innovation-  Modular Toolchains that Enable Digital Transformations
Optimizing Innovation- Modular Toolchains that Enable Digital Transformations
 
Introducing Sitecore - The Experience Platform
Introducing Sitecore - The Experience PlatformIntroducing Sitecore - The Experience Platform
Introducing Sitecore - The Experience Platform
 
APIfying an ERP - ongoing saga
APIfying an ERP - ongoing sagaAPIfying an ERP - ongoing saga
APIfying an ERP - ongoing saga
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 

Recently uploaded (20)

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

  • 1. Building a Data Warehouse for Business Analytics using Spark SQL Copyright Edmunds.com, Inc. (the “Company”). Edmunds and the Edmunds.com logo are registered trademarks of the Company. This document contains proprietary and/or confidential information of the Company. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Company, and any such disclosure requires the express approval of the Company. Blagoy Kaloferov Software Engineer 06/15/2015
  • 2. About me: Blagoy Kaloferov Big Data Software Engineer About my company: Edmunds.com is a car buying platform 18M+ unique visitors each month Today’s talk
  • 3. “It’s about time!”  
 Agenda 1.  Introduction and Architecture 2.  Building Next Gen DWH 3.  Automating Ad Revenue using Spark SQL 4.  Conclusion
  • 4. Business Analytics at Edmunds.com Divisions: DWH Engineers Business Analysts Statistics team Two major groups: Map Reduce / Spark developers Analysts with advanced SQL skills
  • 5. Spark SQL : •  Simplified ETL and enhanced Visualization tools •  Allows anyone in BA to quickly build new Data marts •  Enabled a scalable POC to Production process for our projects Proposition
  • 6. Data Ingestions / ETL Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs DWH Developers Business Analyst Hadoop Cluster
  • 7. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster
  • 8. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL Databricks
  • 9. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL
  • 10. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL ETL
  • 11. Our approach: o  Spark SQL tables similar to existing our Redshift tables o  Best fit for us are Hive tables pointing to S3 delimited data o  Exposed hundreds of Spark SQL tables Exposing S3 data via Spark SQL
  • 12. •  S3 Datasets have thousands of directories: ( location  /year/month/day/hour  ) •  Every new S3 directory for each dataset has to be registered Adding Latest Table Partitions Adding Latest Table Partitions Spark SQL tables S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset2015/05/31/01 2015/05/31/02 Spark SQL table partitions
  • 13. Utilities to for Spark SQL tables and S3: •  Register valid partitions of Spark SQL tables with spark Hive Metastore •  Create Last_X_ Days copy of any Spark SQL table in memory Scheduled jobs: •  Registers latest available directories for all Spark SQL tables programmatically •  Updates Last_3_ Days of core datasets in memory Adding Latest Table PartitionsSpark SQL tables
  • 14. “It’s about time!”  
 S3 and Spark SQL potential o  Now that all S3 data is easily accessible, there are a lot of opportunities ! o  Anyone can ETL on prefixed aggregates and create new Data Marts Spark Cluster Spark SQL tables Last_3_days Tables Utilities and UDF’s Business Intelligence Tools Faster Pipeline Better Insights
  • 15. Platfora Dashboards Pipeline Optimization o  Platfora is a Visualization Analytics Tool o  Provides More than 200 dashboards for BA o  Uses MapReduce to load aggregates Source Dataset Build / Update Lens Dashboards HDFS S3 Joined Datasets MapReduce jobs
  • 16. Platfora Dashboards Pipeline Optimization Limitations: o  We can not optimize the Platfora Map Reduce jobs o  Defined Data Marts not available elsewhere Source Dataset Build / Update Lens Dashboards HDFS S3 Joined Datasets MapReduce jobs Join  on  lead_id  ,  inventory_id,     visitor_id,  dealer_id  …    
  • 17. Platfora Dealer Leads dataset Use Case o  Dealer Leads: Lead Submitter insights dataset o  More than 40 Visual Dashboards are using Dealer Leads Lead Submitter Data Region Info Transaction Data Vehicle Info Dealer Info Dealer Leads Joined Dataset Lead Categorization Lead Submitter Insights join
  • 18. “It’s about time!”  
 Optimizing Dealer Leads Dataset Dealer Leads Platfora Dataset stats: o  300+ attributes o  Usually takes 2-3 hours to build lens o  Scheduled to build daily
  • 19. “It’s about time!”  
 Optimizing Dealer Leads Dataset How do we optimize it?
  • 20. “It’s about time!”  
 Optimizing Dealer Leads Dataset How do we optimize it? 1. Have Spark SQL do the work! o  All required datasets are exposed as Spark SQL tables o  Add new useful attributes 2. Make the ETL easy for anyone in Business Analytics to do it themselves o  Provide utilities and UDF’s so that aggregated data can be exposed to Visualization tools
  • 21. Dealer Leads Using Spark SQL Demo Dealer Leads Data Mart using Spark SQL Demo Expose all original 300+ attributes Enhance: Join with site_traffic Dealer Leads Dataset Traffic Data Lead submitter journey Entry page, page views, device … aggregate_traffic_spark_sql
  • 22. Dealer Leads Using Spark SQL results o  Spark SQL aggregation in 10 minutes. o  Adds dimension attributes that were not available before o  Platfora does not need to join aggregates o  Significantly reduced latency o  Dashboard refreshed every 2 hours instead of once per day. Spark SQL Dealer Leads Lens 10 minutes 10 minutes Dashboards
  • 23. ETL and Visualization takeaway o  Now anyone in BA can perform and support ETL on their own o  New Data marts can be exported to RDBMS S3 New Data Marts Using Spark SQL Redshift Platfora Tableau Spark Cluster Spark SQL tables Last N days Tables Utilities Spark SQL connector ETL load
  • 24. Usual POC process o  Business Analyst Project Prototype in SQL o  Not scalable. Takes Ad Hoc resources from RDBMS o  SQL to Map Reduce o  Transition from two very different frameworks o  MR do not always fit complicated business logic. o  Supported only by Developers POC with Spark SQL vision
  • 25. “It’s about time!”  
 POC with Spark SQL vision new POC process using Spark o  A developer and BA can work together on the same platform and collaborate using Spark o  Its scalable o  No need to switch frameworks when productionalizing
  • 26. Ad Revenue Billing Use Case Definitions: Impression, CPM Line Item, Order Introduction OEM Advertising on website
  • 27. Introduction OEM Advertising on website Ad Revenue computed at the end of the month using OEM provided impression data Ad Revenue Billing Use Case
  • 28. Impressions served * CPM != actual revenue o  There are billing adjustment rules! o  Each OEM has a set of unique rules that determine the actual revenue. o  Adjusting revenue numbers requires manual user inputs from OEM’s Account Manager Ad Revenue End of Month billing
  • 29. •  Line Item groupings Line Item adjustments’ examples Box representing each example ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes
  • 30. •  Line Item groupings Line Item adjustments’ examples Box representing each example ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes SUPPORT  Line  Item  |  CPM  |  impressions
  • 31. •  Line Item groupings Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes Combine data
  • 32. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > Contract ? Line  Item    |  impressions|  Contract
  • 33. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes Line  Item    |  CAPPED_impressions|  Contract Cap impression! impressions_served > Contract ?
  • 34. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > (X% * Contract) ? Line  Item    |  impressions|  Contract
  • 35. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > (X% * Contract) ? Adjust impression! Line  Item    |  ADJUSTED_impressions|  Contract
  • 36. 1_day_impr  |  CPM  |  a?ributes 1d,  7d,  MTD,  QTD,  YTD:  adjusted_impr  |  CPM     Billing Engine Each Line Item Process Vision: Impressions served * CPM = actual revenue Can we automate ad revenue calculation?
  • 37. Automation Challenges o  Many rules , user defined inputs, the logic changes o  Need for scalable unified platform o  Need for tight collaboration between OEM team, Business Analysts and DWH developers Can we automate ad revenue calculation?
  • 38. Billing Rules Modeling Project How do we develop it?
  • 39. Ad Revenue Billing Rules Modeling Project How do we develop it? Spark + Spark SQL approach BA + Developers + OEM Account Team collaboration Goal is an Ad Performance Dashboard Adjusted Billing Ad RevenueOEM:
  • 40. Project Architecture in Spark Sum / Transform / Join Rows Processing separated in phases where input / outputs are Spark SQL tables 1_day_impr  |  CPM  |  a?ributes Spark SQL RowEach Line Item =
  • 41. Project Architecture in Spark Base Line Items Spark SQL Merged Line Items Spark SQL Adjusted Line Items Spark SQL Phase 1 Phase 2 Phase 3 Ad Performance Tableau Dashboard
  • 42. Spark SQL Project Architecture in Spark Business Analysts DFP / Other Impressions Line Item Dimensions Base Line Items Spark SQL join aggregate Phase 1 Merged Line Items Spark SQL Adjusted Line Items Spark SQL Phase 2 Phase 3 Ad Performance Tableau Dashboard
  • 43. Spark SQL Project Architecture in Spark Business Analysts BA + Developer DFP / Other Impressions Line Item Dimensions - Manual Groupings - Other Inputs Spark SQL Line Item Merging Engine OEM Account Managers Base Line Items Spark SQL Merged Line Items Spark SQL join aggregate Phase 1 Phase 2 Adjusted Line Items Spark SQL Phase 3 Ad Performance Tableau Dashboard
  • 44. Spark SQL Billing Rules Engine Project Architecture in Spark Business Analysts BA + Developer DFP / Other Impressions Line Item Dimensions - Manual Groupings - Other Inputs Spark SQL Line Item Merging Engine Ad Performance Tableau Business Analysts OEM Account Managers Base Line Items Spark SQL Merged Line Items Spark SQL Adjusted Line Items Spark SQL join aggregate Phase 1 Phase 2 Phase 3 Dashboard
  • 45. “It’s about time!”  
 Billing Rules Modeling Achievements o  Increased accuracy of revenue forecasts for BA o  Cost savings by not having a dedicated team doing manual adjustments o  Monitor ad delivery rate for orders o  Allows us to detect abnormalities in ad serving o  Collaboration between BA and DWH Developers