SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Data Stream Processing for Beginners with
Apache Kafka and Change Data Capture
-Abhijit Kumar
https://au.linkedin.com/in/abhijitkumar1
Agenda
• Intro to Data Stream Processing
• What is Change Data Capture
• CDC Usecases
• How to capture change data
• CDC with Kafka and Kafka Connect
• Intro to Debezium
• Demo
About Me
• 12+ years of work experience in Software
Development and Architect
• Currently working as a Data Architect at
Deltatre
• Previously worked at EY, Cisco, Dell and SAP
• Moved to Sydney 6 months back from India
One iinteresting fact about me:
Back in India I worked for 3 startups and all three
had a successful exits (Startups acquired by
Cisco, Dell and SAP)
https://au.linkedin.com/in/abhijitkumar1
Email: abhijitk.connect@gmail.com
Data Stream Processing
• Big data technology
• Processing of data in motion
• Computing on data as soon as it is produced
• Continuous streams: sensor events, user activity on a website,
financial trades, etc
• Data is only stored in data stores for processing later.
• Getting stream of data from traditional RDBMS is a challenge.
What is CDC
• CDC is identifying and capturing changes made to a database.
• Change data capture records insert, update, and delete activity that
is applied
• Earlier technologies: Table differencing, change-value selection,
and database triggers.
• Inefficient and had substantial overhead on source servers
• Log-based cdc is adopted now
• Utilises a background process to scan database transaction logs
CDC Usecases
• Data Replication
• Microservice Architecture
• Others: Caching, Alerting, Anomaly Detection
CDC Use Case: Data
Replication
• Replicate data to other DBs and keep content in sync
• Send changes to Data Processing System
• Sharing DB with other consumers/teams
CDC Usecase: Microservice
Architecture
• Share data between services without coupling
• Each Microservices service keeps optimised views of data
coming from source data base.
CDC Other Usecase
• Update caches with changes
• Data sync between caching
• Using Elasticsearch or Solr as data sink to enable full
text search on database
• Alert and anomaly detection
How to do CDC: Legacy
Approach
• Parallel writes: Application level update different DBs at
the same time.
• Polling for changes (identifying the new, delete and
update at source table)
• Triggers (Performance issues, versioning issues,
maintenance issue)
Preferred way for CDC
Monitoring the DB continuously and identifying the changes:
• Reading the database logs
• No inconsistencies due to failure
• Both upstream and downstream applications are unaware of this
application.
Database logs for CDC
• DB maintains log of changes.
• Logs are used for TX recovery, replication, etc
• Mysql - binlog, Postgres - write-ahead log, MongoDB- op
log
• These ordered sequence of changes are created into
stream events for CDC.
Kafka for CDC
• Kafka Key - Table Primary Key
• Kafka guarantees ordering (per partition)
• Pull based mechanism
• Supports compaction
• Horizontal scalability
Kafka Connect
• Tool for streaming data between Apache Kafka and other
data systems.
• Framework for source and sink connectors
• Tracks offsets: Replay in case of failure
• Rich eco-system of connector
CDC Message Format
• Key (Primary key of table ) and Value (Data)
• Payload: Before and After state and Source information
• Message can be wrapped in JSON and AVRO format
Debezium Connectors
• Supports: MySQL, Postgres, MongoDB, Oracle
• Provides Common event format (all connectors have
same format)
• Provides monitoring support via JMX
• Filtering and snapshot modes
Demo
Use docker images to start following:
• Start Zookeeper
• Kafka
• Start Mysql (preloaded data)
• Mysql terminal
• Kafka Connect Service
• Register and start Debezium-mysql connector
• Watch Kafka topic
• Modify records in mysql and view the captured data change in Kafka topic
What to do with CDC events
• Transformation of cdc data can be done with Stream
Application
• Kafka Stream application for Java and Scala developer
• KSQL can be used for non-developers
• Kafka Connect to sink data
Do it yourself
Docker Images
• https://hub.docker.com/u/debezium/
• https://github.com/debezium/docker-images
• https://github.com/confluentinc/cp-docker-images
• https://docs.confluent.io/current/connect/managing/connectors.html
–Abhijit Kumar
“Thank You”

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data Modeling
 
SQL vs. NoSQL. It's always a hard choice.
SQL vs. NoSQL. It's always a hard choice.SQL vs. NoSQL. It's always a hard choice.
SQL vs. NoSQL. It's always a hard choice.
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
 
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERAGeek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
NoSql Brownbag
NoSql BrownbagNoSql Brownbag
NoSql Brownbag
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Sql server etl framework
Sql server etl frameworkSql server etl framework
Sql server etl framework
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
 
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Cassandra Lunch #87: Recreating Cassandra.api using Astra and StargateCassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
 
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful StreamsKafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
 

Ähnlich wie Data Stream Processing for Beginners with Kafka and CDC

SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
sqlserver.co.il
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 

Ähnlich wie Data Stream Processing for Beginners with Kafka and CDC (20)

Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseModern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Migrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceMigrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration Service
 
Scylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the Database
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building cloud native data microservice
Building cloud native data microserviceBuilding cloud native data microservice
Building cloud native data microservice
 
Cloud-native Data
Cloud-native DataCloud-native Data
Cloud-native Data
 
Cloud-Native-Data with Cornelia Davis
Cloud-Native-Data with Cornelia DavisCloud-Native-Data with Cornelia Davis
Cloud-Native-Data with Cornelia Davis
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
 

Kürzlich hochgeladen

Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Kürzlich hochgeladen (20)

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 

Data Stream Processing for Beginners with Kafka and CDC

  • 1. Data Stream Processing for Beginners with Apache Kafka and Change Data Capture -Abhijit Kumar https://au.linkedin.com/in/abhijitkumar1
  • 2. Agenda • Intro to Data Stream Processing • What is Change Data Capture • CDC Usecases • How to capture change data • CDC with Kafka and Kafka Connect • Intro to Debezium • Demo
  • 3. About Me • 12+ years of work experience in Software Development and Architect • Currently working as a Data Architect at Deltatre • Previously worked at EY, Cisco, Dell and SAP • Moved to Sydney 6 months back from India One iinteresting fact about me: Back in India I worked for 3 startups and all three had a successful exits (Startups acquired by Cisco, Dell and SAP) https://au.linkedin.com/in/abhijitkumar1 Email: abhijitk.connect@gmail.com
  • 4. Data Stream Processing • Big data technology • Processing of data in motion • Computing on data as soon as it is produced • Continuous streams: sensor events, user activity on a website, financial trades, etc • Data is only stored in data stores for processing later. • Getting stream of data from traditional RDBMS is a challenge.
  • 5. What is CDC • CDC is identifying and capturing changes made to a database. • Change data capture records insert, update, and delete activity that is applied • Earlier technologies: Table differencing, change-value selection, and database triggers. • Inefficient and had substantial overhead on source servers • Log-based cdc is adopted now • Utilises a background process to scan database transaction logs
  • 6. CDC Usecases • Data Replication • Microservice Architecture • Others: Caching, Alerting, Anomaly Detection
  • 7. CDC Use Case: Data Replication • Replicate data to other DBs and keep content in sync • Send changes to Data Processing System • Sharing DB with other consumers/teams
  • 8. CDC Usecase: Microservice Architecture • Share data between services without coupling • Each Microservices service keeps optimised views of data coming from source data base.
  • 9. CDC Other Usecase • Update caches with changes • Data sync between caching • Using Elasticsearch or Solr as data sink to enable full text search on database • Alert and anomaly detection
  • 10. How to do CDC: Legacy Approach • Parallel writes: Application level update different DBs at the same time. • Polling for changes (identifying the new, delete and update at source table) • Triggers (Performance issues, versioning issues, maintenance issue)
  • 11. Preferred way for CDC Monitoring the DB continuously and identifying the changes: • Reading the database logs • No inconsistencies due to failure • Both upstream and downstream applications are unaware of this application.
  • 12. Database logs for CDC • DB maintains log of changes. • Logs are used for TX recovery, replication, etc • Mysql - binlog, Postgres - write-ahead log, MongoDB- op log • These ordered sequence of changes are created into stream events for CDC.
  • 13. Kafka for CDC • Kafka Key - Table Primary Key • Kafka guarantees ordering (per partition) • Pull based mechanism • Supports compaction • Horizontal scalability
  • 14. Kafka Connect • Tool for streaming data between Apache Kafka and other data systems. • Framework for source and sink connectors • Tracks offsets: Replay in case of failure • Rich eco-system of connector
  • 15. CDC Message Format • Key (Primary key of table ) and Value (Data) • Payload: Before and After state and Source information • Message can be wrapped in JSON and AVRO format
  • 16. Debezium Connectors • Supports: MySQL, Postgres, MongoDB, Oracle • Provides Common event format (all connectors have same format) • Provides monitoring support via JMX • Filtering and snapshot modes
  • 17. Demo Use docker images to start following: • Start Zookeeper • Kafka • Start Mysql (preloaded data) • Mysql terminal • Kafka Connect Service • Register and start Debezium-mysql connector • Watch Kafka topic • Modify records in mysql and view the captured data change in Kafka topic
  • 18. What to do with CDC events • Transformation of cdc data can be done with Stream Application • Kafka Stream application for Java and Scala developer • KSQL can be used for non-developers • Kafka Connect to sink data
  • 19. Do it yourself Docker Images • https://hub.docker.com/u/debezium/ • https://github.com/debezium/docker-images • https://github.com/confluentinc/cp-docker-images • https://docs.confluent.io/current/connect/managing/connectors.html