Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Keep your Metadata Repository Current with Event-Driven Updates using CDC and Kafka Connect (Barbara Eckman, Comcast) Kafka Summit 2020

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 16 Anzeige

Keep your Metadata Repository Current with Event-Driven Updates using CDC and Kafka Connect (Barbara Eckman, Comcast) Kafka Summit 2020

Herunterladen, um offline zu lesen

The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.

The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Keep your Metadata Repository Current with Event-Driven Updates using CDC and Kafka Connect (Barbara Eckman, Comcast) Kafka Summit 2020 (20)

Anzeige

Weitere von confluent (20)

Aktuellste (20)

Anzeige

Keep your Metadata Repository Current with Event-Driven Updates using CDC and Kafka Connect (Barbara Eckman, Comcast) Kafka Summit 2020

  1. 1. Keep your Metadata Repository Current with Event-Driven Updates using CDC and Kafka Connect Barbara Eckman, Ph.D. Senior Principal Software Architect Comcast
  2. 2. What Makes Your Life Hard? Powerful business insights come from models integrating data that spans silos. • Where is the data? • What does it mean? • Who produced it? • Who touched it during its journey? • What is its quality?
  3. 3. Industry-Leading Metadata Management A unified metadata repository – Heterogeneity, extensibility, scalability, accessibility – Emerging industry standard around K-V store, free-text search, graphdb • Comcast has had it since 2017! – Apache Atlas with many Comcast-contributed extensions Keynote
  4. 4. Portal: Human Metadata Input, Search Metadata management and storage Metadata, lineage sources On-prem Hadoop AWS Public Cloud RDBMS Systems Current State Automated capture Event-driven pushes RDMBS Connector Nightly pulls
  5. 5. Apache Kafka Legacy Architecture for Metadata Capture Pull bulk metadata on regular schedule Identify deltas Java connector Comcast code OS code metadata RDBMS
  6. 6. Portal: Human Metadata Input, Search Metadata management and storage Metadata, lineage sources On-prem Hadoop AWS Public Cloud RDBMS Systems Desired State Automated capture Event-driven pushes Event-driven push: HOW???
  7. 7. Obvious Fix that didn’t work: • Use Kafka Connect log-based CDC for pushing metadata deltas! Binlogs don’t capture changes in metadata except as DDL statements
  8. 8. Another Obvious Fix that didn’t quite work: • Use Kafka Connect query-based CDC for querying system tables for metadata deltas! Queries on system tables give current state only. Deletes are not captured.
  9. 9. Less Obvious Fix that DID work: • Outbox Architectural Pattern • Maintain copies of system catalog metadata tables as outboxes Binlogs DO capture changes in metadata outboxes, just like regular data tables
  10. 10. Creating Outboxes from System Catalogs create table outbox.info_columns as select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, DATA_TYPE, IS_NULLABLE, CHARACTER_MAXIMUM_LENGTH from information_schema.columns; • MERGE updates from information_schema.columns to outbox.info_columns on a regular schedule (eg 5 minutes, 30 minutes, 1 hour) • Why not use a trigger to activate merge? – No triggers on system tables
  11. 11. Apache Kafka New Architecture pushes deltas as they happen Comcast code OS code metadata MySQL, MSSQL Debezium Log-based CDC Source Connector Kafka Connect Atlas Sink Connector
  12. 12. Portal: Human Metadata Input, Search Metadata management and storage Metadata, lineage sources On-prem Hadoop AWS Public Cloud RDBMS Systems Desired State Is New State! Automated capture Event-driven pushes
  13. 13. Acknowledgements
  14. 14. barbara_eckman@cable.comcast.com
  15. 15. Backup Slides
  16. 16. Less Obvious Fix that kinda worked: • Use JCBC CDC Connector • Maintain copies of system catalog metadata tables • Configure MSSQL CDC on these tables Query-based CDC that captures changes in metadata outboxes. RDBMS-specific. EXECUTE sys.sp_cdc_enable_db;

Hinweis der Redaktion

  • Where is data about X?
    Most authoritative copy?
    What does this attribute mean?
    Fear of joining on the wrong “account number”?
    Who produced this data?
    Who messed up the data I produced?
    Data quality
  • DON’T TALK ABOUT DETAILS OF THE RDBMS CONNECTOR YET!
    Dataset types:
    Hadoop
    Hive tables
    Hdfs files
    Kafka topics
    Kinesis streams
    AWS:
    S3 buckets, prefixes
    glue tables
    Dynamodb
    RDBMS
    Schemas:
    Avro
    JSON
    Delimited
    Parquet
    ML Models, Feature Engineering, Feature Sets
    Microservices

  • Trigger on every table to send message to kafka when insert/update/delete is made?????!!!!!
  • THEMES:
    Heterogeneity
    Extensibility
    Scale

    Dataset types currently supported:
    RDBMS
    Kafka topics
    Kinesis streams
    Schemas:
    Avro
    JSON
    Delimited
    Parquet
    AWS:
    S3 buckets, prefixes
    glue tables
    Dynamodb
    Hadoop
    Hive tables
    Hdfs files
    ML Models, Feature Engineering, Feature Sets
  • ROBIN MOFFATT
  • Also, has to be configured separately for each db.

×