SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Building a Data Warehouse for
Business Analytics using Spark SQL
Copyright Edmunds.com, Inc. (the “Company”). Edmunds and the Edmunds.com logo are registered
trademarks of the Company. This document contains proprietary and/or confidential information of the
Company. No part of this document or the information it contains may be used, or disclosed to any
person or entity, for any purpose other than advancing the best interests of the Company, and any
such disclosure requires the express approval of the Company. 
Blagoy Kaloferov
Software Engineer
06/15/2015
About me:
Blagoy Kaloferov
Big Data Software Engineer
About my company:
Edmunds.com is a car buying platform
18M+ unique visitors each month
Today’s talk
“It’s about time!”
 


Agenda
1.  Introduction and Architecture
2.  Building Next Gen DWH
3.  Automating Ad Revenue using Spark SQL
4.  Conclusion
Business Analytics at
Edmunds.com
Divisions:
DWH Engineers
Business Analysts
Statistics team
Two major groups:
Map Reduce / Spark developers
Analysts with advanced SQL skills
Spark SQL :
•  Simplified ETL and enhanced Visualization tools
•  Allows anyone in BA to quickly build new Data marts
•  Enabled a scalable POC to Production process for our
projects
Proposition
Data Ingestions / ETL
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
DWH Developers Business Analyst
Hadoop Cluster
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Spark
Spark SQL
Databricks
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Spark
Spark SQL
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Map Reduce
jobs
Database
Redshift
Business
Intelligence
Tools
Platfora
Tableau
DWH Developers Business Analysts
Hadoop Cluster
Spark
Spark SQL
ETL
Our approach:
o  Spark SQL tables similar to
existing our Redshift tables
o  Best fit for us are Hive tables
pointing to S3 delimited data
o  Exposed hundreds of Spark SQL
tables
Exposing S3 data via
Spark SQL
•  S3 Datasets have thousands of directories:
( location	
  /year/month/day/hour	
  )
•  Every new S3 directory for each dataset has to be registered
Adding Latest Table Partitions
Adding Latest Table Partitions
Spark SQL tables
S3 dataset
S3 dataset
S3 dataset
S3 dataset
S3 dataset
S3 dataset
S3 dataset2015/05/31/01
2015/05/31/02
Spark SQL table
partitions
Utilities to for Spark SQL tables and S3:
•  Register valid partitions of Spark SQL tables with spark Hive Metastore
•  Create Last_X_ Days copy of any Spark SQL table in memory
Scheduled jobs:
•  Registers latest available directories for all Spark SQL tables
programmatically
•  Updates Last_3_ Days of core datasets in memory
Adding Latest Table PartitionsSpark SQL tables
“It’s about time!”
 


S3 and Spark SQL potential
o  Now that all S3 data is easily
accessible, there are a lot of
opportunities !
o  Anyone can ETL on prefixed
aggregates and create new Data Marts
Spark Cluster
Spark SQL tables
Last_3_days Tables
Utilities and UDF’s
Business
Intelligence
Tools
Faster Pipeline
Better Insights
Platfora Dashboards Pipeline
Optimization
o  Platfora is a Visualization Analytics Tool
o  Provides More than 200 dashboards for BA
o  Uses MapReduce to load aggregates
Source Dataset Build / Update Lens Dashboards
HDFS
S3
Joined
Datasets
MapReduce
jobs
Platfora Dashboards Pipeline
Optimization
Limitations:
o  We can not optimize the Platfora Map Reduce jobs
o  Defined Data Marts not available elsewhere
Source Dataset Build / Update Lens Dashboards
HDFS
S3
Joined
Datasets
MapReduce
jobs
Join	
  on	
  lead_id	
  ,	
  inventory_id,	
  	
  
visitor_id,	
  dealer_id	
  …	
  
	
  
Platfora Dealer Leads dataset
Use Case
o  Dealer Leads: Lead Submitter insights dataset
o  More than 40 Visual Dashboards are using Dealer Leads
Lead
Submitter Data
Region
Info
Transaction
Data
Vehicle
Info
Dealer
Info
Dealer Leads
Joined Dataset
Lead Categorization
Lead
Submitter
Insights
join
“It’s about time!”
 


Optimizing Dealer Leads Dataset
Dealer Leads Platfora Dataset stats:
o  300+ attributes
o  Usually takes 2-3 hours to build lens
o  Scheduled to build daily
“It’s about time!”
 


Optimizing Dealer Leads Dataset
How do we optimize it?
“It’s about time!”
 


Optimizing Dealer Leads Dataset
How do we optimize it?
1. Have Spark SQL do the work!
o  All required datasets are exposed as Spark SQL tables
o  Add new useful attributes
2. Make the ETL easy for anyone in
Business Analytics to do it themselves
o  Provide utilities and UDF’s so that aggregated data can
be exposed to Visualization tools
Dealer Leads Using Spark SQL
Demo
Dealer Leads Data Mart using Spark SQL
Demo
Expose all original 300+ attributes
Enhance: Join with site_traffic
Dealer Leads
Dataset
Traffic Data
Lead submitter journey
Entry page, page views, device …
aggregate_traffic_spark_sql
Dealer Leads Using Spark SQL
results
o  Spark SQL aggregation in 10 minutes.
o  Adds dimension attributes that were not available before
o  Platfora does not need to join aggregates
o  Significantly reduced latency
o  Dashboard refreshed every 2 hours instead of once per day.
Spark SQL
Dealer Leads
Lens
10 minutes 10 minutes
Dashboards
ETL and Visualization
takeaway
o  Now anyone in BA can perform and support ETL on their own
o  New Data marts can be exported to RDBMS
S3
New Data Marts
Using Spark SQL Redshift
Platfora
Tableau
Spark Cluster
Spark SQL tables
Last N days Tables
Utilities
Spark SQL
connector
ETL
load
Usual POC process
o  Business Analyst Project Prototype in SQL
o  Not scalable. Takes Ad Hoc resources from RDBMS
o  SQL to Map Reduce
o  Transition from two very different frameworks
o  MR do not always fit complicated business logic.
o  Supported only by Developers
POC with Spark SQL vision
“It’s about time!”
 


POC with Spark SQL vision
new POC process using Spark
o  A developer and BA can work together on the same
platform and collaborate using Spark
o  Its scalable
o  No need to switch frameworks when productionalizing
Ad Revenue Billing Use Case
Definitions:
Impression, CPM
Line Item, Order
Introduction
OEM Advertising on website
Introduction
OEM Advertising on website
Ad Revenue computed at the end of the month
using OEM provided impression data
Ad Revenue Billing Use Case
Impressions served * CPM != actual revenue
o  There are billing adjustment rules!
o  Each OEM has a set of unique rules that
determine the actual revenue.
o  Adjusting revenue numbers requires manual
user inputs from OEM’s Account Manager
Ad Revenue End of Month billing
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
ORIGINAL	
  Line	
  Item	
  	
  |	
  CPM	
  |	
  impressions	
  |	
  a?ributes
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
ORIGINAL	
  Line	
  Item	
  	
  |	
  CPM	
  |	
  impressions	
  |	
  a?ributes
SUPPORT	
  Line	
  Item	
  |	
  CPM	
  |	
  impressions
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
Combine data
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
impressions_served > Contract ?
Line	
  Item	
  	
  |	
  impressions|	
  Contract
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
Line	
  Item	
  	
  |	
  CAPPED_impressions|	
  Contract
Cap impression!
impressions_served > Contract ?
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
impressions_served > (X% * Contract) ?
Line	
  Item	
  	
  |	
  impressions|	
  Contract
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impressions	
  |	
  a?ributes
impressions_served > (X% * Contract) ?
Adjust impression!
Line	
  Item	
  	
  |	
  ADJUSTED_impressions|	
  Contract
1_day_impr	
  |	
  CPM	
  |	
  a?ributes
1d,	
  7d,	
  MTD,	
  QTD,	
  YTD:	
  adjusted_impr	
  |	
  CPM	
  	
  
Billing Engine
Each Line Item
Process Vision:
Impressions served * CPM = actual revenue
Can we automate ad revenue calculation?
Automation Challenges
o  Many rules , user defined inputs, the logic changes
o  Need for scalable unified platform
o  Need for tight collaboration between OEM team,
Business Analysts and DWH developers
Can we automate ad revenue calculation?
Billing Rules Modeling Project
How do we develop it?
Ad Revenue
Billing Rules Modeling Project
How do we develop it?
Spark + Spark SQL approach
BA + Developers + OEM Account Team collaboration
Goal is an Ad Performance Dashboard
Adjusted Billing Ad RevenueOEM:
Project Architecture in Spark
Sum / Transform / Join Rows
Processing separated in phases where input / outputs are
Spark SQL tables
1_day_impr	
  |	
  CPM	
  |	
  a?ributes Spark SQL RowEach Line Item =
Project Architecture in Spark
Base Line Items
Spark SQL
Merged Line Items
Spark SQL
Adjusted Line Items
Spark SQL
Phase 1 Phase 2 Phase 3
Ad Performance
Tableau
Dashboard
Spark SQL
Project Architecture in Spark
Business Analysts
DFP / Other
Impressions
Line Item
Dimensions
Base Line Items
Spark SQL
join
aggregate
Phase 1
Merged Line Items
Spark SQL
Adjusted Line Items
Spark SQL
Phase 2 Phase 3
Ad Performance
Tableau
Dashboard
Spark SQL
Project Architecture in Spark
Business Analysts BA + Developer
DFP / Other
Impressions
Line Item
Dimensions
- Manual Groupings
- Other Inputs
Spark SQL
Line Item
Merging
Engine
OEM Account
Managers
Base Line Items
Spark SQL
Merged Line Items
Spark SQL
join
aggregate
Phase 1 Phase 2
Adjusted Line Items
Spark SQL
Phase 3
Ad Performance
Tableau
Dashboard
Spark SQL
Billing Rules
Engine
Project Architecture in Spark
Business Analysts BA + Developer
DFP / Other
Impressions
Line Item
Dimensions
- Manual Groupings
- Other Inputs
Spark SQL
Line Item
Merging
Engine
Ad Performance
Tableau
Business Analysts
OEM Account
Managers
Base Line Items
Spark SQL
Merged Line Items
Spark SQL
Adjusted Line Items
Spark SQL
join
aggregate
Phase 1 Phase 2 Phase 3 Dashboard
“It’s about time!”
 


Billing Rules Modeling
Achievements
o  Increased accuracy of revenue forecasts for BA
o  Cost savings by not having a dedicated team doing
manual adjustments
o  Monitor ad delivery rate for orders
o  Allows us to detect abnormalities in ad serving
o  Collaboration between BA and DWH Developers
Thank you!
Blagoy Kaloferov
bkaloferov@edmunds.com
Questions?
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Weitere ähnliche Inhalte

Was ist angesagt?

Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Web Services Korea
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data EngineeringAnanth PackkilDurai
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...Amazon Web Services Korea
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기NAVER D2
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 
Stream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMRStream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMRCicero Joasyo Mateus de Moura
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?Juhong Park
 

Was ist angesagt? (20)

Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data Engineering
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Stream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMRStream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMR
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
 

Ähnlich wie Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j
 
Racing for the Flexibility Integrating Aras into the IT Landscape
Racing for the Flexibility Integrating Aras into the IT LandscapeRacing for the Flexibility Integrating Aras into the IT Landscape
Racing for the Flexibility Integrating Aras into the IT LandscapeAras
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Ebay OLAP Cube
Ebay OLAP CubeEbay OLAP Cube
Ebay OLAP Cubebfowles
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Neo4j
 
SharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- TrimantraSharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- TrimantraMihir G.
 
Make xCommerce fit to you
Make xCommerce fit to youMake xCommerce fit to you
Make xCommerce fit to youRobert Debowski
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API ProgramPronovix
 
QuantiQ TEQ Day : Dynamics NAV
QuantiQ TEQ Day : Dynamics NAVQuantiQ TEQ Day : Dynamics NAV
QuantiQ TEQ Day : Dynamics NAVQuantiQ Technology
 
Optimizing Innovation- Modular Toolchains that Enable Digital Transformations
Optimizing Innovation-  Modular Toolchains that Enable Digital TransformationsOptimizing Innovation-  Modular Toolchains that Enable Digital Transformations
Optimizing Innovation- Modular Toolchains that Enable Digital TransformationsTasktop
 
Optimizing Innovation: Modular Toolchains that Enable Digital Transformations
Optimizing Innovation: Modular Toolchains that Enable Digital TransformationsOptimizing Innovation: Modular Toolchains that Enable Digital Transformations
Optimizing Innovation: Modular Toolchains that Enable Digital TransformationsDevOps.com
 
Introducing Sitecore - The Experience Platform
Introducing Sitecore - The Experience PlatformIntroducing Sitecore - The Experience Platform
Introducing Sitecore - The Experience PlatformAdrian IORGU
 
APIfying an ERP - ongoing saga
APIfying an ERP - ongoing sagaAPIfying an ERP - ongoing saga
APIfying an ERP - ongoing sagaMarjukka Niinioja
 

Ähnlich wie Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds) (20)

Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Racing for the Flexibility Integrating Aras into the IT Landscape
Racing for the Flexibility Integrating Aras into the IT LandscapeRacing for the Flexibility Integrating Aras into the IT Landscape
Racing for the Flexibility Integrating Aras into the IT Landscape
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Ebay OLAP Cube
Ebay OLAP CubeEbay OLAP Cube
Ebay OLAP Cube
 
RESUME_RASMI3.6
RESUME_RASMI3.6RESUME_RASMI3.6
RESUME_RASMI3.6
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
SaurabhKasyap
SaurabhKasyapSaurabhKasyap
SaurabhKasyap
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
SharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- TrimantraSharePoint Project Potfolio- Trimantra
SharePoint Project Potfolio- Trimantra
 
Rohit Resume
Rohit ResumeRohit Resume
Rohit Resume
 
Make xCommerce fit to you
Make xCommerce fit to youMake xCommerce fit to you
Make xCommerce fit to you
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
 
Oracle Analytics Cloud
Oracle Analytics CloudOracle Analytics Cloud
Oracle Analytics Cloud
 
QuantiQ TEQ Day : Dynamics NAV
QuantiQ TEQ Day : Dynamics NAVQuantiQ TEQ Day : Dynamics NAV
QuantiQ TEQ Day : Dynamics NAV
 
ConnectWise Reporting Tools Overview
ConnectWise Reporting Tools OverviewConnectWise Reporting Tools Overview
ConnectWise Reporting Tools Overview
 
Optimizing Innovation- Modular Toolchains that Enable Digital Transformations
Optimizing Innovation-  Modular Toolchains that Enable Digital TransformationsOptimizing Innovation-  Modular Toolchains that Enable Digital Transformations
Optimizing Innovation- Modular Toolchains that Enable Digital Transformations
 
Optimizing Innovation: Modular Toolchains that Enable Digital Transformations
Optimizing Innovation: Modular Toolchains that Enable Digital TransformationsOptimizing Innovation: Modular Toolchains that Enable Digital Transformations
Optimizing Innovation: Modular Toolchains that Enable Digital Transformations
 
Introducing Sitecore - The Experience Platform
Introducing Sitecore - The Experience PlatformIntroducing Sitecore - The Experience Platform
Introducing Sitecore - The Experience Platform
 
APIfying an ERP - ongoing saga
APIfying an ERP - ongoing sagaAPIfying an ERP - ongoing saga
APIfying an ERP - ongoing saga
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 

Kürzlich hochgeladen (17)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

  • 1. Building a Data Warehouse for Business Analytics using Spark SQL Copyright Edmunds.com, Inc. (the “Company”). Edmunds and the Edmunds.com logo are registered trademarks of the Company. This document contains proprietary and/or confidential information of the Company. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Company, and any such disclosure requires the express approval of the Company. Blagoy Kaloferov Software Engineer 06/15/2015
  • 2. About me: Blagoy Kaloferov Big Data Software Engineer About my company: Edmunds.com is a car buying platform 18M+ unique visitors each month Today’s talk
  • 3. “It’s about time!”  
 Agenda 1.  Introduction and Architecture 2.  Building Next Gen DWH 3.  Automating Ad Revenue using Spark SQL 4.  Conclusion
  • 4. Business Analytics at Edmunds.com Divisions: DWH Engineers Business Analysts Statistics team Two major groups: Map Reduce / Spark developers Analysts with advanced SQL skills
  • 5. Spark SQL : •  Simplified ETL and enhanced Visualization tools •  Allows anyone in BA to quickly build new Data marts •  Enabled a scalable POC to Production process for our projects Proposition
  • 6. Data Ingestions / ETL Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs DWH Developers Business Analyst Hadoop Cluster
  • 7. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster
  • 8. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL Databricks
  • 9. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL
  • 10. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL ETL
  • 11. Our approach: o  Spark SQL tables similar to existing our Redshift tables o  Best fit for us are Hive tables pointing to S3 delimited data o  Exposed hundreds of Spark SQL tables Exposing S3 data via Spark SQL
  • 12. •  S3 Datasets have thousands of directories: ( location  /year/month/day/hour  ) •  Every new S3 directory for each dataset has to be registered Adding Latest Table Partitions Adding Latest Table Partitions Spark SQL tables S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset2015/05/31/01 2015/05/31/02 Spark SQL table partitions
  • 13. Utilities to for Spark SQL tables and S3: •  Register valid partitions of Spark SQL tables with spark Hive Metastore •  Create Last_X_ Days copy of any Spark SQL table in memory Scheduled jobs: •  Registers latest available directories for all Spark SQL tables programmatically •  Updates Last_3_ Days of core datasets in memory Adding Latest Table PartitionsSpark SQL tables
  • 14. “It’s about time!”  
 S3 and Spark SQL potential o  Now that all S3 data is easily accessible, there are a lot of opportunities ! o  Anyone can ETL on prefixed aggregates and create new Data Marts Spark Cluster Spark SQL tables Last_3_days Tables Utilities and UDF’s Business Intelligence Tools Faster Pipeline Better Insights
  • 15. Platfora Dashboards Pipeline Optimization o  Platfora is a Visualization Analytics Tool o  Provides More than 200 dashboards for BA o  Uses MapReduce to load aggregates Source Dataset Build / Update Lens Dashboards HDFS S3 Joined Datasets MapReduce jobs
  • 16. Platfora Dashboards Pipeline Optimization Limitations: o  We can not optimize the Platfora Map Reduce jobs o  Defined Data Marts not available elsewhere Source Dataset Build / Update Lens Dashboards HDFS S3 Joined Datasets MapReduce jobs Join  on  lead_id  ,  inventory_id,     visitor_id,  dealer_id  …    
  • 17. Platfora Dealer Leads dataset Use Case o  Dealer Leads: Lead Submitter insights dataset o  More than 40 Visual Dashboards are using Dealer Leads Lead Submitter Data Region Info Transaction Data Vehicle Info Dealer Info Dealer Leads Joined Dataset Lead Categorization Lead Submitter Insights join
  • 18. “It’s about time!”  
 Optimizing Dealer Leads Dataset Dealer Leads Platfora Dataset stats: o  300+ attributes o  Usually takes 2-3 hours to build lens o  Scheduled to build daily
  • 19. “It’s about time!”  
 Optimizing Dealer Leads Dataset How do we optimize it?
  • 20. “It’s about time!”  
 Optimizing Dealer Leads Dataset How do we optimize it? 1. Have Spark SQL do the work! o  All required datasets are exposed as Spark SQL tables o  Add new useful attributes 2. Make the ETL easy for anyone in Business Analytics to do it themselves o  Provide utilities and UDF’s so that aggregated data can be exposed to Visualization tools
  • 21. Dealer Leads Using Spark SQL Demo Dealer Leads Data Mart using Spark SQL Demo Expose all original 300+ attributes Enhance: Join with site_traffic Dealer Leads Dataset Traffic Data Lead submitter journey Entry page, page views, device … aggregate_traffic_spark_sql
  • 22. Dealer Leads Using Spark SQL results o  Spark SQL aggregation in 10 minutes. o  Adds dimension attributes that were not available before o  Platfora does not need to join aggregates o  Significantly reduced latency o  Dashboard refreshed every 2 hours instead of once per day. Spark SQL Dealer Leads Lens 10 minutes 10 minutes Dashboards
  • 23. ETL and Visualization takeaway o  Now anyone in BA can perform and support ETL on their own o  New Data marts can be exported to RDBMS S3 New Data Marts Using Spark SQL Redshift Platfora Tableau Spark Cluster Spark SQL tables Last N days Tables Utilities Spark SQL connector ETL load
  • 24. Usual POC process o  Business Analyst Project Prototype in SQL o  Not scalable. Takes Ad Hoc resources from RDBMS o  SQL to Map Reduce o  Transition from two very different frameworks o  MR do not always fit complicated business logic. o  Supported only by Developers POC with Spark SQL vision
  • 25. “It’s about time!”  
 POC with Spark SQL vision new POC process using Spark o  A developer and BA can work together on the same platform and collaborate using Spark o  Its scalable o  No need to switch frameworks when productionalizing
  • 26. Ad Revenue Billing Use Case Definitions: Impression, CPM Line Item, Order Introduction OEM Advertising on website
  • 27. Introduction OEM Advertising on website Ad Revenue computed at the end of the month using OEM provided impression data Ad Revenue Billing Use Case
  • 28. Impressions served * CPM != actual revenue o  There are billing adjustment rules! o  Each OEM has a set of unique rules that determine the actual revenue. o  Adjusting revenue numbers requires manual user inputs from OEM’s Account Manager Ad Revenue End of Month billing
  • 29. •  Line Item groupings Line Item adjustments’ examples Box representing each example ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes
  • 30. •  Line Item groupings Line Item adjustments’ examples Box representing each example ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes SUPPORT  Line  Item  |  CPM  |  impressions
  • 31. •  Line Item groupings Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes Combine data
  • 32. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > Contract ? Line  Item    |  impressions|  Contract
  • 33. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes Line  Item    |  CAPPED_impressions|  Contract Cap impression! impressions_served > Contract ?
  • 34. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > (X% * Contract) ? Line  Item    |  impressions|  Contract
  • 35. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > (X% * Contract) ? Adjust impression! Line  Item    |  ADJUSTED_impressions|  Contract
  • 36. 1_day_impr  |  CPM  |  a?ributes 1d,  7d,  MTD,  QTD,  YTD:  adjusted_impr  |  CPM     Billing Engine Each Line Item Process Vision: Impressions served * CPM = actual revenue Can we automate ad revenue calculation?
  • 37. Automation Challenges o  Many rules , user defined inputs, the logic changes o  Need for scalable unified platform o  Need for tight collaboration between OEM team, Business Analysts and DWH developers Can we automate ad revenue calculation?
  • 38. Billing Rules Modeling Project How do we develop it?
  • 39. Ad Revenue Billing Rules Modeling Project How do we develop it? Spark + Spark SQL approach BA + Developers + OEM Account Team collaboration Goal is an Ad Performance Dashboard Adjusted Billing Ad RevenueOEM:
  • 40. Project Architecture in Spark Sum / Transform / Join Rows Processing separated in phases where input / outputs are Spark SQL tables 1_day_impr  |  CPM  |  a?ributes Spark SQL RowEach Line Item =
  • 41. Project Architecture in Spark Base Line Items Spark SQL Merged Line Items Spark SQL Adjusted Line Items Spark SQL Phase 1 Phase 2 Phase 3 Ad Performance Tableau Dashboard
  • 42. Spark SQL Project Architecture in Spark Business Analysts DFP / Other Impressions Line Item Dimensions Base Line Items Spark SQL join aggregate Phase 1 Merged Line Items Spark SQL Adjusted Line Items Spark SQL Phase 2 Phase 3 Ad Performance Tableau Dashboard
  • 43. Spark SQL Project Architecture in Spark Business Analysts BA + Developer DFP / Other Impressions Line Item Dimensions - Manual Groupings - Other Inputs Spark SQL Line Item Merging Engine OEM Account Managers Base Line Items Spark SQL Merged Line Items Spark SQL join aggregate Phase 1 Phase 2 Adjusted Line Items Spark SQL Phase 3 Ad Performance Tableau Dashboard
  • 44. Spark SQL Billing Rules Engine Project Architecture in Spark Business Analysts BA + Developer DFP / Other Impressions Line Item Dimensions - Manual Groupings - Other Inputs Spark SQL Line Item Merging Engine Ad Performance Tableau Business Analysts OEM Account Managers Base Line Items Spark SQL Merged Line Items Spark SQL Adjusted Line Items Spark SQL join aggregate Phase 1 Phase 2 Phase 3 Dashboard
  • 45. “It’s about time!”  
 Billing Rules Modeling Achievements o  Increased accuracy of revenue forecasts for BA o  Cost savings by not having a dedicated team doing manual adjustments o  Monitor ad delivery rate for orders o  Allows us to detect abnormalities in ad serving o  Collaboration between BA and DWH Developers