Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Data Stream Processing for Beginners with Kafka and CDC
1. Data Stream Processing for Beginners with
Apache Kafka and Change Data Capture
-Abhijit Kumar
https://au.linkedin.com/in/abhijitkumar1
2. Agenda
• Intro to Data Stream Processing
• What is Change Data Capture
• CDC Usecases
• How to capture change data
• CDC with Kafka and Kafka Connect
• Intro to Debezium
• Demo
3. About Me
• 12+ years of work experience in Software
Development and Architect
• Currently working as a Data Architect at
Deltatre
• Previously worked at EY, Cisco, Dell and SAP
• Moved to Sydney 6 months back from India
One iinteresting fact about me:
Back in India I worked for 3 startups and all three
had a successful exits (Startups acquired by
Cisco, Dell and SAP)
https://au.linkedin.com/in/abhijitkumar1
Email: abhijitk.connect@gmail.com
4. Data Stream Processing
• Big data technology
• Processing of data in motion
• Computing on data as soon as it is produced
• Continuous streams: sensor events, user activity on a website,
financial trades, etc
• Data is only stored in data stores for processing later.
• Getting stream of data from traditional RDBMS is a challenge.
5. What is CDC
• CDC is identifying and capturing changes made to a database.
• Change data capture records insert, update, and delete activity that
is applied
• Earlier technologies: Table differencing, change-value selection,
and database triggers.
• Inefficient and had substantial overhead on source servers
• Log-based cdc is adopted now
• Utilises a background process to scan database transaction logs
7. CDC Use Case: Data
Replication
• Replicate data to other DBs and keep content in sync
• Send changes to Data Processing System
• Sharing DB with other consumers/teams
8. CDC Usecase: Microservice
Architecture
• Share data between services without coupling
• Each Microservices service keeps optimised views of data
coming from source data base.
9. CDC Other Usecase
• Update caches with changes
• Data sync between caching
• Using Elasticsearch or Solr as data sink to enable full
text search on database
• Alert and anomaly detection
10. How to do CDC: Legacy
Approach
• Parallel writes: Application level update different DBs at
the same time.
• Polling for changes (identifying the new, delete and
update at source table)
• Triggers (Performance issues, versioning issues,
maintenance issue)
11. Preferred way for CDC
Monitoring the DB continuously and identifying the changes:
• Reading the database logs
• No inconsistencies due to failure
• Both upstream and downstream applications are unaware of this
application.
12. Database logs for CDC
• DB maintains log of changes.
• Logs are used for TX recovery, replication, etc
• Mysql - binlog, Postgres - write-ahead log, MongoDB- op
log
• These ordered sequence of changes are created into
stream events for CDC.
14. Kafka Connect
• Tool for streaming data between Apache Kafka and other
data systems.
• Framework for source and sink connectors
• Tracks offsets: Replay in case of failure
• Rich eco-system of connector
15. CDC Message Format
• Key (Primary key of table ) and Value (Data)
• Payload: Before and After state and Source information
• Message can be wrapped in JSON and AVRO format
16. Debezium Connectors
• Supports: MySQL, Postgres, MongoDB, Oracle
• Provides Common event format (all connectors have
same format)
• Provides monitoring support via JMX
• Filtering and snapshot modes
17. Demo
Use docker images to start following:
• Start Zookeeper
• Kafka
• Start Mysql (preloaded data)
• Mysql terminal
• Kafka Connect Service
• Register and start Debezium-mysql connector
• Watch Kafka topic
• Modify records in mysql and view the captured data change in Kafka topic
18. What to do with CDC events
• Transformation of cdc data can be done with Stream
Application
• Kafka Stream application for Java and Scala developer
• KSQL can be used for non-developers
• Kafka Connect to sink data
19. Do it yourself
Docker Images
• https://hub.docker.com/u/debezium/
• https://github.com/debezium/docker-images
• https://github.com/confluentinc/cp-docker-images
• https://docs.confluent.io/current/connect/managing/connectors.html