Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging.
Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining.
While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas
1. Comcast collects, stores, and uses all data in accordance with our privacy
disclosures to users and applicable laws.
An architecture for integrated data discovery and
lineage over on-prem datasources and public cloud
with Apache Atlas
Barbara Eckman, Ph.D.
Principal Architect
2. Our Group’s Mission
Gather and organize metadata and lineage from diverse sources
to make data universally discoverable, integrate-able and accessible
to empower insight-driven decision making
• Dozens of tenants and stakeholders
• Millions of messages/second captured
• Tens of PB of long term data storage
• Thousands of cores of distributed compute
3. Data Discovery and Lineage
• Where can I find data about X?
• How is this data structured?
• Who produced it?
• What does it mean?
• How ”mature” is it?
• What attributes in your data match attributes in mine? (e.g., potential join
fields)
• How has the data changed in its journey from ingest to where I’m viewing it?
• Where are the derivatives of my original data to be found?
4. Challenges of Heterogeneity
• There are many excellent data discovery tools
-OS and commercial
• BUT limited in scope of data set types supported
-Only a certain Big Data ecosystem provider
-Only RDBMS’s, text documents, emails
• We need to add new data set types from multiple providers
• We need to integrate metadata from diverse data sets, both in public cloud
and on-prem
• We need to integrate lineage from diverse loading jobs, both batch and
streaming
5. Outline
• #TBT** to DataWorks Summit 2017
• Reorganization Yields New Requirements (12/2017)
• Challenges encountered and met
• New Integrative Data Discovery and Lineage Architecture
• Next steps
** “Throw Back Tuesday”
9. What are Avro and Atlas?
• A data serialization system
-A JSON-based schema language
-A compact serialized format
• APIs in a bunch of languages
• Benefits:
-Cross-language support for dynamic
data access
-Simple but expressive schema
definition and evolution
-Built-in documentation, defaults
• Data Discovery, Lineage
-Browser UI
-Rest/Java and kafka APIs
-Synchronous and Asynchonous
messaging
-Free-text, typed, graph search
• Integrated Security (Apache Ranger)
• Schema Registry as well as Metadata
Repo
Open Source
Extensible
10. DataWorks Summit 2017: Atlas Metadata Types
Built-in Atlas Types Custom Atlas Entities Custom Atlas Processes
• DataSet
• Process
• Hive tables
• Kafka topics
• Avro Schemas
-Reciprocally linked to
all other dataset
types
• Extensions to Kafka
topic
-sizing parameters
• AWS S3 Object Store
- S3 Bucket, Pseudo-
Directory
• Lineage Processes
-Avro schema
versioning
-Storing data to S3
objects
• Enrichment Processes
on streaming data
-Re-publishing to
kafka topics
12. Reorganization Yields New Requirements
• Integrate on-prem data sources
-Traditional warehousing, RDBMS, Atlas on Hadoop
• Increase signal/noise ratio
• Atlas Connectors for all metadata sources
-At rest
-Streaming
• Lineage capture
-Streaming and batch
• End-user annotations
-Stakeholders, documentation
13. New Data Source Type
Generic RDBMS Atlas typedefs
• Instance
• Database (schema)
• Table
• Column
• Index
• Foreign Key
Used for:
• Informatica Metadata Manager, on top of
Teradata EDW
• Oracle
• Others to come
Comments:
• Back pointers to parent class at every level of
hierarchy
• Owned_refs ( aka “composite”) at every level
• Load only whitelisted databases to increase
signal, reduce noise
14. Connectors for all metadata sources
• One codebase for all sources
• Differ in means of acquiring metadata, but use the same methods to package
data into AtlasEntities for publishing via kafka api
-RDBMS
-Atlas to Atlas (supports different versions)
-Kafka topics
-Avro schemas
-AWS datalake objects
15. Dataset Lineage capture
• Generic lineage process typedef
-Used for both batch and streaming lineage capture
-Attributes include transforms performed, config parameters
-May be subclassed from to add attributes for individual cases
• Unlike metadata, lineage capture is event-triggered
-In AWS, Cloudwatch event on Glue crawler triggers lambda
function
-In on-prem hadoop, Inotify event on hdfs triggers microservice
• Triggered components assemble requisite info and publish to
Atlas lineage connector
Acknowledgements:
Datalake Team
16. End-user annotations: new tag typedefs
Stakeholders
• Individuals
-Data Business Owner
-Data Technical Owner
-Data Steward
-Delivery Manager
-Data Architect
• Teams
-Delivery Team
-Support Team
Documentation
• array<map<string, string>>
-Name
-Description
-URL
Acknowledgements:
Portal team
17. Portal UI
REST/Asynchronous API
Apache Atlas on AWS
Integrative Metadata Store for Search
Duplicatesmetadatafrommetadatasourcessufficienttoenablediscovery
Free-text searchSQL-like search Graph search
RDMBS
Connector
Atlas-to-Atlas
ConnectorLineage Connector
Other
Connectors
DataStream
Connector
New Integrative Data Discovery and Lineage Architecture
Drill downtoindividual metastores’
UIsfordeepexploration
Drill downtoindividual metastores’
UIsfordeepexploration
Metadata
Sources Batch Data
Ingest Jobs
Other
Metadata
repos
ML Pipeline
Models
Feature Eng
Jobs
Streaming
Data
Ingest Jobs
Oracle,
MySQL,
MSSQL, etc
Catalog
AWS S3
Datalake,
Kinesis
Streams
Avro
Schemas
Kafka
Topics
Informatica
MDM
Teradata
On-prem
Atlas for
Hive
Tables
Model
Connector
AWSobject
Connector
On-premPublic Cloud
18. Challenges encountered
Integrating EDW metadata
• Problem:
-vastly larger size objects than previous, even
though integration restricted to whitelist of
top priority databases
• Solution
-Increase config params in java, kafka, atlas
-Use kafka api
-Build a single AtlasEntity object
-Send a single message for each database
-Optimize query to get data from Informatica
(pivot)
Schema on read legacy in on-
prem datalake
• Problem:
-Many, many duplicate hive
tables
-Unknown lineage/semantic
equivalence
• Solution:
-Data-based ML solution to
find dups and equivalences
(POC in progress)
20. • End-to-end metadata repository
-Models are first-class objects, captured
with rich metadata (eg input file schema,
feature set schema, model parameters,
etc)
-Feature engineering jobs are first-class
objects, captured with rich metadata (eg
model, data quality threshold, input file
schema, owner)
-Build metadata capture on models and
feature engineering jobs into the ML
pipeline
Metadata repo for discovery and documentation of models
Data Lake
Meta-Data
Data
Set
A
Feature
Engineering
ML
Model
Prediction
Feature
Set
B
21. Extending avro schema governance to other schema types
•Interactive user app facilitates creation of schemas and enforces
compliance with Comcast conventions
-Each schema is reviewed and approved by at least one human being
•Comcast conventions:
-Non-vacuous doc comments required to document every attribute
-All attributes must have default values
-Unnecessary complexity is discouraged (YAGNI principle)
•Library of commonly used subschemas
-Available via app, use is encouraged by reviewers
22. An architecture for integrated data discovery and lineage over
on-prem datasources and public cloud with Apache Atlas
• Throw Back To DataWorks Summit 2017
• Reorganization Yields New Requirements (12/2017)
-New Atlas data types
-Connectors for all metadata sources
-Lineage capture
-End-user annotations via Tags
• Today’s Architecture
• Challenges encountered and met
• Next steps
•Extra Goodies!
23. Comcast Contributions to Apache Atlas OS Community
Jira Ticket Description
ATLAS-2694 Avro schema typedef and support for Avro schema evolution in
Atlas
ATLAS-2696 Typedef extensions for Kafka in Atlas
ATLAS-2708 AWS S3 data lake typedefs for Atlas
ATLAS-2709 RDBMS typedefs for Atlas
ATLAS-2724 UI enhancement for Avro schemas and other JSON-valued
attributes (coming soon)
https://issues.apache.org/jira/browse/ATLAS-XXXX
24. Suggested Reading
Creating A Data-Driven Enterprise in
Media
Comcast Chapter:
How a Focus on
Customer Experience
Led to a Focus on Data
Science
Both can be reached from: https://www.oreilly.com/ideas/data-governance-and-the-death-of-schema-on-read
Data Governance and the
Death of Schema on Read