2. 2
https://medium.com/jobteaser-dev-team
About me and where I work
@knil_sama
JobTeaser
Preparing the new generation to reach its
full potential, embrace the future with
optimism and make its mark in the world
Clément Demonchy
Data engineering with:
Python, AWS, Kubernetes, Kafka,
and anything that works
We are hiring !
3. 3
The need
Applicative data is used to
● Recommend offers to student
● Measure KPIs for the company
● Enrich users’ experience
13. 13
1. Create a user with the following rights
SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT
2. Update configuration
Changes on MySQL
14. 14
To avoid connection dropping/hanging
expire_logs_days setting is not working with RDS
RDS specific changes
15. 15
Increase workers memory
=> For initial snapshot surge (Database)
Connectors in distributed mode
=> Fully stateless
auto.create.topics.enable topic set to true
=> Large number of tables and new tables
Changes on Kafka
16. 16
snapshot.mode :
With data: initial, when_needed or never
Without data: schema_only, schema_only_recovery
On RDS you need a global lock during snapshot
otherwise use snapshot.locking.mode : minimal, extended none
Debezium Snapshot
18. 18
Tables Columns Content
Anonymization strategies with Debezium
table.whitelist/table.blacklist column.mask.with.length.chars
Always prefer explicit whitelisting
because you can’t prevent columns changing name or being added
column.blacklist column.whitelist
19. 19
In our case the pre-existing database was
~ 40 GB db
> 100 Tables
Initial snapshot took 40 minutes to complete
In practice
21. 21
Big row issue, we have row with a size greater than default value !!
In theory you only need of increasing max.request.size, but it’s not enough
On Kafka connects worker
On Kafka brokers
Issues with Debezium (part 1)
22. 22
Some DCL command
Make the connector crash
So we had to set in connector
Issues with Debezium (part 2)
23. 23
If MySQL goes down, connector will fail
and you have to restart the task
1) Identify the failing task
2) Restart it
Easy case: binlog and consumer offset still exist
Connector stream recovery
24. 24
Basic monitoring:
● Prometheus
● Grafana
● Alerting on Slack
Be less strict and log errors instead of crashing for
Watch out for DEBUG with some loggers else you will flood the worker
Staying alive
26. 26
Confluent created connector
Stream back debezium event to PostgreSQL
Handle create, update and schema changes
But … deleted records are not removed in target database
(A PR was merged recently and new default is to delete them)
JDBC Connect to the rescue
27. 27
Debezium provides an option on JDBC connector
That will add a flag column “__deleted” on every tables
Other usefuls SMTs : RemoveNulls, MultiTimestampConverter
Single Message Transformations (SMTs)