Wir haben unsere Datenschutzbestimmungen aktualisiert. Klicke hier, um dir die _Einzelheiten anzusehen. Tippe hier, um dir die Einzelheiten anzusehen.
Aktiviere deine kostenlose 30-tägige Testversion, um unbegrenzt zu lesen.
Erstelle deine kostenlose 30-tägige Testversion, um weiterzulesen.
Herunterladen, um offline zu lesen
The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.
The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.
Sie haben diese Folie bereits ins Clipboard „“ geclippt.
Sie haben Ihre erste Folie geclippt!
Durch Clippen können Sie wichtige Folien sammeln, die Sie später noch einmal ansehen möchten. Passen Sie den Namen des Clipboards an, um Ihre Clips zu speichern.Die SlideShare-Familie hat sich gerade vergrößert. Genießen Sie nun Zugriff auf Millionen eBooks, Bücher, Hörbücher, Zeitschriften und mehr von Scribd.
Jederzeit kündbar.Unbegrenztes Lesevergnügen
Lerne schneller und intelligenter von Spitzenfachleuten
Unbegrenzte Downloads
Lade es dir zum Lernen offline und unterwegs herunter
Außerdem erhältst du auch kostenlosen Zugang zu Scribd!
Sofortiger Zugriff auf Millionen von E-Books, Hörbüchern, Zeitschriften, Podcasts und mehr.
Lese und höre offline mit jedem Gerät.
Kostenloser Zugang zu Premium-Diensten wie TuneIn, Mubi und mehr.
Wir haben unsere Datenschutzbestimmungen aktualisiert, um den neuen globalen Regeln zum Thema Datenschutzbestimmungen gerecht zu werden und dir einen Einblick in die begrenzten Möglichkeiten zu geben, wie wir deine Daten nutzen.
Die Einzelheiten findest du unten. Indem du sie akzeptierst, erklärst du dich mit den aktualisierten Datenschutzbestimmungen einverstanden.
Vielen Dank!