SlideShare a Scribd company logo
1 of 33
Download to read offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Thiago Rigo and David Mariassy, GetYourGuide
Modern ETL Pipelines
with Change Data Capture
#UnifiedDataAnalytics #SparkAISummit
Who are we?
5 years of experience in
Business Intelligence and Data
Engineering roles from the Berlin
e-commerce scene.
Data Engineer, Data Platform.
Software engineer for the past 7
years, last 3 focused on data
engineering.
Senior Data Engineer, Data
Platform.
Agenda
1 Intro to GetYourGuide
2 GYG’s Legacy ETL Pipeline
3 Rivulus ETL Pipeline
4 Conclusion
5 Questions
Intro to GetYourGuide
We make it simple to book and enjoy
incredible experiences
Europe’s largest marketplace
for travel experiences
50k+
Products in 150+
countries
25M+
Tickets sold
$650M+
In VC funding
600+
Strong global team
150+
Traveler nationalities
GYG’s Legacy ETL Pipeline
Breaking
schema
changes
upstream
Requires
special
knowledge
Long
recovery
times
Difficult to
test
Bad SLAs
Where we started
Requires
special
knowledge
Breaking
schema
changes
upstream
Long
recovery
times
Difficult to
test
Bad SLAs
Automatic
handling of
schema
changes
Familiar
tooling (Scala,
SQL)
Maximum
parallelism
Built for
testability
Better SLAs
What we wanted
Rivulus ETL Pipeline
Overview
Extraction Layer
The pipeline
Debezium
● Open source distributed platform for change
data capture
● Can read several databases
○ MySQL, Postgres, Cassandra, Oracle, SQL
Server, and Mongo DB
● It works as a connector part of Kafka Connect
● It streams the database's event log into Kafka
● Streams those changes to Kafka
● Scala library
● Keeps track of all schema changes applied to
the tables
● Holds PK, timestamp and partition columns
● Prevents breaking changes from being
introduced
○ Type changes
● Upcast types
● Schema Service works on column level
Schema Service
Automatic
handling of
schema
changes
Avro Converter
● Regular Scala application
● Runs as part of Airflow DAG
● Reads raw Avro files from S3
● Communicates with Schema Service to handle
schema changes automatically
● Writes out Parquet files
Automatic
handling of
schema
changes
Upsert
● Spark application
● Runs as part of Airflow DAG
● Reads in new Parquet files
● Communicates with Schema Service to get PK,
timestamp and partition columns
● Compacts the data based on table’s PK
● Creates Hive table which contains a replica of
source DB
Transformation Layer
The performance penalty of managing
transformation dependencies
inefficiently
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
● Small set of
transformations.
● Small team / single
engineer.
● Simple one-to-one type
dependencies.
● Defining an optimal
dependency model by
hand is possible.
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
● Growing set of transformations.
● Growing team.
● One-to-many / many-to-many type
dependencies.
● Defining a dependency model by
hand becomes cumbersome and
error-prone
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The hard
choice
between
performance
and
correctness
● As optimal dependency models
become ever more difficult to maintain
and expand manually without making
errors, teams decide to optimise for
correctness over performance.
● This results in crude dependency
models with a lot of sequential
execution in places where
parallelization would be possible.
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The
performance
bottleneck
strikes back
The hard
choice
between
performance
and
correctness
● Sequential execution results in
long execution and long recovery
times. In other words Poor SLAs.
● 💣🔥
Rivulus SQL for automated dependency
inference
Maximum
parallelism
● SQL transformations
○ A collection of Rivulus SQL files that make use of a set of
custom template variables.
● Executor app
○ Spark app that executes a single transformation at a
time.
● DGB (Dependency Graph Builder)
○ Parses all files in the SQL library and builds a
dependency graph of the transformations by
interpolating Rivulus SQL template vars.
● Airflow
○ Executes the transformations on Databricks in the order
specified by the DGB.
Main components
Rivulus SQL syntax
● {% reference:target ‘dim_tour’ %}
○ Declares a dependency between this transformation and the
dim_tour transformation that must be defined in the same SQL
library
● {% reference:source ‘gyg__customer’ %}
○ Declares a dependency between this transformation and a raw
data source (gyg.customer) that is loaded to Hive by an
extraction job
● {% load ‘file.sql’ %}
○ Loads a reusable subquery defined in file.sql into this
transformation.
Familiar
tooling (Scala,
SQL)
{
"fact_nps_feedback": {
"source_dependencies": [
"gyg__nps_feedback"
],
"transformation_dependencies": [
"dim_nps_feedback_stage"
]
}
}
DGB
Airflow
Example
Executor
app
invocations
on DB
SELECT
nps_feedback_id
, nps_feedback_stage_id
, booking_id
, score
, feedback
, update_timestamp
, source
FROM {% reference:source 'gyg__nps_feedback' %} AS nf
LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs
ON nfs.nps_feedback_stage_name = nf.stage
Build time
Rivulus SQL
Build time
Runtime
A word on testing
● Maximum parallelism enhances testability
● Separation of config from code
○ Configurable input and output paths
Built for
testability
Conclusion
Results
Eliminated
vulnerability
to upstream
schema
changes
Democratized
our ETL by
migrating all
business logic
to SQL
Minimized
recovery
time by
maximizing
parallelism
Designed
for E2E
testability
Cut
processing
time by 70%
(further
reductions are
possible)
Next Steps
Intra-day
micro-batches
Database
Replication as
a Service
Rivulus SQL is
GYG’s
standard tool
for writing
transformations
Delta for Upsert
Questions?
We’re hiring!
https://careers.getyourguide.com

More Related Content

What's hot

Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
Carlos Sierra
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

What's hot (20)

Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Alfresco node lifecyle, services and zones
Alfresco node lifecyle, services and zonesAlfresco node lifecyle, services and zones
Alfresco node lifecyle, services and zones
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 

Similar to Modern ETL Pipelines with Change Data Capture

Similar to Modern ETL Pipelines with Change Data Capture (20)

Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slides
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Building data pipelines at Shopee with DEC
Building data pipelines at Shopee with DECBuilding data pipelines at Shopee with DEC
Building data pipelines at Shopee with DEC
 
Transform Your Data Integration Platform From Informatica To ODI
Transform Your Data Integration Platform From Informatica To ODI Transform Your Data Integration Platform From Informatica To ODI
Transform Your Data Integration Platform From Informatica To ODI
 
GraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits togetherGraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits together
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Recently uploaded (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Modern ETL Pipelines with Change Data Capture

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Thiago Rigo and David Mariassy, GetYourGuide Modern ETL Pipelines with Change Data Capture #UnifiedDataAnalytics #SparkAISummit
  • 3. Who are we? 5 years of experience in Business Intelligence and Data Engineering roles from the Berlin e-commerce scene. Data Engineer, Data Platform. Software engineer for the past 7 years, last 3 focused on data engineering. Senior Data Engineer, Data Platform.
  • 4. Agenda 1 Intro to GetYourGuide 2 GYG’s Legacy ETL Pipeline 3 Rivulus ETL Pipeline 4 Conclusion 5 Questions
  • 6. We make it simple to book and enjoy incredible experiences
  • 7. Europe’s largest marketplace for travel experiences 50k+ Products in 150+ countries 25M+ Tickets sold $650M+ In VC funding 600+ Strong global team 150+ Traveler nationalities
  • 10. Requires special knowledge Breaking schema changes upstream Long recovery times Difficult to test Bad SLAs Automatic handling of schema changes Familiar tooling (Scala, SQL) Maximum parallelism Built for testability Better SLAs What we wanted
  • 15. Debezium ● Open source distributed platform for change data capture ● Can read several databases ○ MySQL, Postgres, Cassandra, Oracle, SQL Server, and Mongo DB ● It works as a connector part of Kafka Connect ● It streams the database's event log into Kafka ● Streams those changes to Kafka
  • 16. ● Scala library ● Keeps track of all schema changes applied to the tables ● Holds PK, timestamp and partition columns ● Prevents breaking changes from being introduced ○ Type changes ● Upcast types ● Schema Service works on column level Schema Service Automatic handling of schema changes
  • 17. Avro Converter ● Regular Scala application ● Runs as part of Airflow DAG ● Reads raw Avro files from S3 ● Communicates with Schema Service to handle schema changes automatically ● Writes out Parquet files Automatic handling of schema changes
  • 18. Upsert ● Spark application ● Runs as part of Airflow DAG ● Reads in new Parquet files ● Communicates with Schema Service to get PK, timestamp and partition columns ● Compacts the data based on table’s PK ● Creates Hive table which contains a replica of source DB
  • 20. The performance penalty of managing transformation dependencies inefficiently
  • 21. The gradual forsaking of performance on the altar of dependency management Humble beginnings ● Small set of transformations. ● Small team / single engineer. ● Simple one-to-one type dependencies. ● Defining an optimal dependency model by hand is possible.
  • 22. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon ● Growing set of transformations. ● Growing team. ● One-to-many / many-to-many type dependencies. ● Defining a dependency model by hand becomes cumbersome and error-prone
  • 23. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The hard choice between performance and correctness ● As optimal dependency models become ever more difficult to maintain and expand manually without making errors, teams decide to optimise for correctness over performance. ● This results in crude dependency models with a lot of sequential execution in places where parallelization would be possible.
  • 24. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The performance bottleneck strikes back The hard choice between performance and correctness ● Sequential execution results in long execution and long recovery times. In other words Poor SLAs. ● 💣🔥
  • 25. Rivulus SQL for automated dependency inference Maximum parallelism
  • 26. ● SQL transformations ○ A collection of Rivulus SQL files that make use of a set of custom template variables. ● Executor app ○ Spark app that executes a single transformation at a time. ● DGB (Dependency Graph Builder) ○ Parses all files in the SQL library and builds a dependency graph of the transformations by interpolating Rivulus SQL template vars. ● Airflow ○ Executes the transformations on Databricks in the order specified by the DGB. Main components
  • 27. Rivulus SQL syntax ● {% reference:target ‘dim_tour’ %} ○ Declares a dependency between this transformation and the dim_tour transformation that must be defined in the same SQL library ● {% reference:source ‘gyg__customer’ %} ○ Declares a dependency between this transformation and a raw data source (gyg.customer) that is loaded to Hive by an extraction job ● {% load ‘file.sql’ %} ○ Loads a reusable subquery defined in file.sql into this transformation. Familiar tooling (Scala, SQL)
  • 28. { "fact_nps_feedback": { "source_dependencies": [ "gyg__nps_feedback" ], "transformation_dependencies": [ "dim_nps_feedback_stage" ] } } DGB Airflow Example Executor app invocations on DB SELECT nps_feedback_id , nps_feedback_stage_id , booking_id , score , feedback , update_timestamp , source FROM {% reference:source 'gyg__nps_feedback' %} AS nf LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs ON nfs.nps_feedback_stage_name = nf.stage Build time Rivulus SQL Build time Runtime
  • 29. A word on testing ● Maximum parallelism enhances testability ● Separation of config from code ○ Configurable input and output paths Built for testability
  • 31. Results Eliminated vulnerability to upstream schema changes Democratized our ETL by migrating all business logic to SQL Minimized recovery time by maximizing parallelism Designed for E2E testability Cut processing time by 70% (further reductions are possible)
  • 32. Next Steps Intra-day micro-batches Database Replication as a Service Rivulus SQL is GYG’s standard tool for writing transformations Delta for Upsert