Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Just-in-Time Data Warehousing on
Databricks: Change Data Capture
and Schema On Read
Jason Pohl, Data Solutions Engineer
Denny Lee, Technology Evangelist

About the speaker: Jason Pohl
Jason Pohl is a solutions engineer with Databricks,
focused on helping customers become successful
with their data initiatives. Jason has spent his
career building data-driven products and solutions.
2

About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with
Databricks; he is a hands-on data sciences engineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premises and cloud.
Prior to joining Databricks, Denny worked as a Senior
Director of Data Sciences Engineering at Concur and
was part of the incubation team that built Hadoop on
Windows and Azure (currently known as HDInsight).
3

We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.

…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT
2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update

Traditional Data Warehousing Pain Points 
Inelasticity of compute and storage resources
• Burst workloads requires max. load capacity planning
• Fixed size DW = compute and storage to scale linearly together
(these are orthogonal problems)
• Expensive conundrum:
• If your DW is successful, you cannot easily exapnd
• If there is overcapacity = idle resources

Rigid architecture that’s difficult to change 
• Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be
pre-built.
• Rigidity = maintaining costly ETL pipelines
• Expend finite resources to continually augment pipelines to absorb new data.

Limited advanced analytics capabilities 
• Want more than what business intelligence and data warehousing provides
• More than just counts, aggregates and trends
• Desire forecasting using ML, segmentation, graph processing, etc.

Just-in-Time Data Warehousing 
Scale resources on demand
13
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources

Direct access to data sources
14
same data sources

Scale resources on demand
15
same data sources

Change Data Capture 
What is it?
• System to automatically capture changes in source system (e.g.
transactional database) and automatically capture those changes
in a target system (e.g. data warehouse).
• Important for data warehouses because it allows it to record (and
ultimately report) any changes, e.g.:
• Customer A buys a pair of skis for $250 on 1/2/2015
• On 1/5/2015, realize that the purchase was $350 not $250
16

Source to Target
17
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00

Add new row
18
Source
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00

Update an existing row
19
Source
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
Target
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $350.00
103 1/3/2016 Disc $15.00

Update an existing row
20
Source Target
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $350.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
102 1/2/2016 Skis $350.00 1/5/2016

Demo
High Watermark with LastUpdatedDate
21

22
Stage Data from Employee Database

23
Update Records in Employee Source Database
UPDATE employees
SET last_name = 'Spark'
WHERE emp_no = 16894

Job to Automate CDC
24
Source Target
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
Jobs
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016

25
Add a column to the Departments table
ALTER TABLE departments
ADD COLUMN dept_desc VARCHAR(50)
UPDATE departments
SET dept_desc = dept_name

Job to Automate CDC
Source Target
Jobs
dept_no
dept_name
dept_no
dept_namedept_no
dept_name
dept_desc

Notebooks
To access the notebooks, please reference the attachments in the Just-in-Time Data
Warehousing on Databricks: Change Data Capture and Schema On Read webinar.
• Stage Data From Employee Database:
• Notebook that starts the process
• Defines the ETL process
• Change Schema in Employee Source Database
• Update Records in Employee Source Database
• Validate Departments

Resources
• Just-in-Time Data Warehousing Solution Brief
• Building a Turbo-fast Data Warehousing Platform with
Databricks
• Spark DataFrames: Simple and Fast Analysis of Structured Data
• Transitioning from Traditional DW to Spark in OR Predictive
Modeling
• Advertising Technology Sample Notebook (Part 1)

More resources
• Databricks Guide
• Apache Spark User Guide
• Databricks Community Forum
• Training courses: public classes, MOOCs, & private training
• Databricks Community Edition: Free hosted Apache Spark.
Join the waitlist for the beta release!
29

Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read