This document discusses challenges with keeping a metadata repository current using event-driven updates from data sources. It describes how using Apache Kafka and the Debezium connector to capture changes from database "outbox" tables that mirror system catalog metadata tables allows pushing metadata deltas to the repository in real-time. This overcomes limitations of log-based and query-based CDC approaches when applied directly to database system tables.
Keep your Metadata Repository Current with Event-Driven Updates using CDC and Kafka Connect
1. Keep your Metadata Repository Current
with Event-Driven Updates using
CDC and Kafka Connect
Barbara Eckman, Ph.D.
Senior Principal Software Architect
Comcast
2. What Makes Your Life Hard?
Powerful business insights come from models integrating data that spans silos.
• Where is the data?
• What does it mean?
• Who produced it?
• Who touched it during its journey?
• What is its quality?
3. Industry-Leading Metadata Management
A unified metadata repository
– Heterogeneity, extensibility, scalability, accessibility
– Emerging industry standard around K-V store, free-text
search, graphdb
• Comcast has had it since 2017!
– Apache Atlas with many Comcast-contributed extensions
Keynote
4. Portal: Human Metadata Input, Search
Metadata management and storage
Metadata,
lineage
sources
On-prem
Hadoop
AWS Public Cloud RDBMS Systems
Current State
Automated capture
Event-driven
pushes
RDMBS
Connector
Nightly pulls
5. Apache
Kafka
Legacy Architecture for Metadata Capture
Pull bulk metadata
on regular schedule
Identify deltas
Java connector
Comcast code
OS code
metadata
RDBMS
6. Portal: Human Metadata Input, Search
Metadata management and storage
Metadata,
lineage
sources
On-prem
Hadoop
AWS Public Cloud RDBMS Systems
Desired State
Automated capture
Event-driven
pushes
Event-driven push:
HOW???
7. Obvious Fix that didn’t work:
• Use Kafka Connect log-based CDC
for pushing metadata deltas!
Binlogs don’t capture changes in
metadata except as DDL statements
8. Another Obvious Fix that didn’t quite work:
• Use Kafka Connect query-based
CDC for querying system tables for
metadata deltas!
Queries on system tables give current
state only. Deletes are not captured.
9. Less Obvious Fix that DID work:
• Outbox Architectural Pattern
• Maintain copies of system catalog
metadata tables as outboxes
Binlogs DO capture changes in
metadata outboxes, just like regular
data tables
10. Creating Outboxes from System Catalogs
create table outbox.info_columns as
select
TABLE_CATALOG,
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
DATA_TYPE,
IS_NULLABLE,
CHARACTER_MAXIMUM_LENGTH
from information_schema.columns;
• MERGE updates from
information_schema.columns to
outbox.info_columns on a regular
schedule (eg 5 minutes, 30
minutes, 1 hour)
• Why not use a trigger to activate
merge?
– No triggers on system tables
11. Apache
Kafka
New Architecture pushes deltas as they happen
Comcast code
OS code
metadata
MySQL,
MSSQL
Debezium
Log-based
CDC Source
Connector
Kafka Connect
Atlas Sink
Connector
12. Portal: Human Metadata Input, Search
Metadata management and storage
Metadata,
lineage
sources
On-prem
Hadoop
AWS Public Cloud RDBMS Systems
Desired State Is New State!
Automated capture Event-driven pushes
16. Less Obvious Fix that kinda worked:
• Use JCBC CDC Connector
• Maintain copies of system catalog
metadata tables
• Configure MSSQL CDC on these
tables
Query-based CDC that captures
changes in metadata outboxes.
RDBMS-specific.
EXECUTE sys.sp_cdc_enable_db;
Hinweis der Redaktion
Where is data about X?
Most authoritative copy?
What does this attribute mean?
Fear of joining on the wrong “account number”?
Who produced this data?
Who messed up the data I produced?
Data quality
DON’T TALK ABOUT DETAILS OF THE RDBMS CONNECTOR YET!
Dataset types:
Hadoop
Hive tables
Hdfs files
Kafka topics
Kinesis streams
AWS:
S3 buckets, prefixes
glue tables
Dynamodb
RDBMS
Schemas:
Avro
JSON
Delimited
Parquet
ML Models, Feature Engineering, Feature Sets
Microservices
Trigger on every table to send message to kafka when insert/update/delete is made?????!!!!!