SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Comcast collects, stores, and uses all data in accordance with our privacy
disclosures to users and applicable laws.
An architecture for integrated data discovery and
lineage over on-prem datasources and public cloud
with Apache Atlas
Barbara Eckman, Ph.D.
Principal Architect
Our Group’s Mission
Gather and organize metadata and lineage from diverse sources
to make data universally discoverable, integrate-able and accessible
to empower insight-driven decision making
• Dozens of tenants and stakeholders
• Millions of messages/second captured
• Tens of PB of long term data storage
• Thousands of cores of distributed compute
Data Discovery and Lineage
• Where can I find data about X?
• How is this data structured?
• Who produced it?
• What does it mean?
• How ”mature” is it?
• What attributes in your data match attributes in mine? (e.g., potential join
fields)
• How has the data changed in its journey from ingest to where I’m viewing it?
• Where are the derivatives of my original data to be found?
Challenges of Heterogeneity
• There are many excellent data discovery tools
-OS and commercial
• BUT limited in scope of data set types supported
-Only a certain Big Data ecosystem provider
-Only RDBMS’s, text documents, emails
• We need to add new data set types from multiple providers
• We need to integrate metadata from diverse data sets, both in public cloud
and on-prem
• We need to integrate lineage from diverse loading jobs, both batch and
streaming
Outline
• #TBT** to DataWorks Summit 2017
• Reorganization Yields New Requirements (12/2017)
• Challenges encountered and met
• New Integrative Data Discovery and Lineage Architecture
• Next steps
** “Throw Back Tuesday”
#TBT to DataWorks Summit 2017
Data Platform
Architecture
(DataWorks
Summit 2017)
METADATA
DataWorks Summit 2017: Key Metadata Technologies
Apache Avro
Atlas.apache.orgAvro.apache.org
What are Avro and Atlas?
• A data serialization system
-A JSON-based schema language
-A compact serialized format
• APIs in a bunch of languages
• Benefits:
-Cross-language support for dynamic
data access
-Simple but expressive schema
definition and evolution
-Built-in documentation, defaults
• Data Discovery, Lineage
-Browser UI
-Rest/Java and kafka APIs
-Synchronous and Asynchonous
messaging
-Free-text, typed, graph search
• Integrated Security (Apache Ranger)
• Schema Registry as well as Metadata
Repo
Open Source
Extensible
DataWorks Summit 2017: Atlas Metadata Types
Built-in Atlas Types Custom Atlas Entities Custom Atlas Processes
• DataSet
• Process
• Hive tables
• Kafka topics
• Avro Schemas
-Reciprocally linked to
all other dataset
types
• Extensions to Kafka
topic
-sizing parameters
• AWS S3 Object Store
- S3 Bucket, Pseudo-
Directory
• Lineage Processes
-Avro schema
versioning
-Storing data to S3
objects
• Enrichment Processes
on streaming data
-Re-publishing to
kafka topics
Reorganization Yields New Requirements
Reorganization Yields New Requirements
• Integrate on-prem data sources
-Traditional warehousing, RDBMS, Atlas on Hadoop
• Increase signal/noise ratio
• Atlas Connectors for all metadata sources
-At rest
-Streaming
• Lineage capture
-Streaming and batch
• End-user annotations
-Stakeholders, documentation
New Data Source Type
Generic RDBMS Atlas typedefs
• Instance
• Database (schema)
• Table
• Column
• Index
• Foreign Key
Used for:
• Informatica Metadata Manager, on top of
Teradata EDW
• Oracle
• Others to come
Comments:
• Back pointers to parent class at every level of
hierarchy
• Owned_refs ( aka “composite”) at every level
• Load only whitelisted databases to increase
signal, reduce noise
Connectors for all metadata sources
• One codebase for all sources
• Differ in means of acquiring metadata, but use the same methods to package
data into AtlasEntities for publishing via kafka api
-RDBMS
-Atlas to Atlas (supports different versions)
-Kafka topics
-Avro schemas
-AWS datalake objects
Dataset Lineage capture
• Generic lineage process typedef
-Used for both batch and streaming lineage capture
-Attributes include transforms performed, config parameters
-May be subclassed from to add attributes for individual cases
• Unlike metadata, lineage capture is event-triggered
-In AWS, Cloudwatch event on Glue crawler triggers lambda
function
-In on-prem hadoop, Inotify event on hdfs triggers microservice
• Triggered components assemble requisite info and publish to
Atlas lineage connector
Acknowledgements:
Datalake Team
End-user annotations: new tag typedefs
Stakeholders
• Individuals
-Data Business Owner
-Data Technical Owner
-Data Steward
-Delivery Manager
-Data Architect
• Teams
-Delivery Team
-Support Team
Documentation
• array<map<string, string>>
-Name
-Description
-URL
Acknowledgements:
Portal team
Portal UI
REST/Asynchronous API
Apache Atlas on AWS
Integrative Metadata Store for Search
Duplicatesmetadatafrommetadatasourcessufficienttoenablediscovery
Free-text searchSQL-like search Graph search
RDMBS
Connector
Atlas-to-Atlas
ConnectorLineage Connector
Other
Connectors
DataStream
Connector
New Integrative Data Discovery and Lineage Architecture
Drill downtoindividual metastores’
UIsfordeepexploration
Drill downtoindividual metastores’
UIsfordeepexploration
Metadata
Sources Batch Data
Ingest Jobs
Other
Metadata
repos
ML Pipeline
Models
Feature Eng
Jobs
Streaming
Data
Ingest Jobs
Oracle,
MySQL,
MSSQL, etc
Catalog
AWS S3
Datalake,
Kinesis
Streams
Avro
Schemas
Kafka
Topics
Informatica
MDM
Teradata
On-prem
Atlas for
Hive
Tables
Model
Connector
AWSobject
Connector
On-premPublic Cloud
Challenges encountered
Integrating EDW metadata
• Problem:
-vastly larger size objects than previous, even
though integration restricted to whitelist of
top priority databases
• Solution
-Increase config params in java, kafka, atlas
-Use kafka api
-Build a single AtlasEntity object
-Send a single message for each database
-Optimize query to get data from Informatica
(pivot)
Schema on read legacy in on-
prem datalake
• Problem:
-Many, many duplicate hive
tables
-Unknown lineage/semantic
equivalence
• Solution:
-Data-based ML solution to
find dups and equivalences
(POC in progress)
Next Steps
• End-to-end metadata repository
-Models are first-class objects, captured
with rich metadata (eg input file schema,
feature set schema, model parameters,
etc)
-Feature engineering jobs are first-class
objects, captured with rich metadata (eg
model, data quality threshold, input file
schema, owner)
-Build metadata capture on models and
feature engineering jobs into the ML
pipeline
Metadata repo for discovery and documentation of models
Data Lake
Meta-Data
Data
Set
A
Feature
Engineering
ML
Model
Prediction
Feature
Set
B
Extending avro schema governance to other schema types
•Interactive user app facilitates creation of schemas and enforces
compliance with Comcast conventions
-Each schema is reviewed and approved by at least one human being
•Comcast conventions:
-Non-vacuous doc comments required to document every attribute
-All attributes must have default values
-Unnecessary complexity is discouraged (YAGNI principle)
•Library of commonly used subschemas
-Available via app, use is encouraged by reviewers
An architecture for integrated data discovery and lineage over
on-prem datasources and public cloud with Apache Atlas
• Throw Back To DataWorks Summit 2017
• Reorganization Yields New Requirements (12/2017)
-New Atlas data types
-Connectors for all metadata sources
-Lineage capture
-End-user annotations via Tags
• Today’s Architecture
• Challenges encountered and met
• Next steps
•Extra Goodies!
Comcast Contributions to Apache Atlas OS Community
Jira Ticket Description
ATLAS-2694 Avro schema typedef and support for Avro schema evolution in
Atlas
ATLAS-2696 Typedef extensions for Kafka in Atlas
ATLAS-2708 AWS S3 data lake typedefs for Atlas
ATLAS-2709 RDBMS typedefs for Atlas
ATLAS-2724 UI enhancement for Avro schemas and other JSON-valued
attributes (coming soon)
https://issues.apache.org/jira/browse/ATLAS-XXXX
Suggested Reading
Creating A Data-Driven Enterprise in
Media
Comcast Chapter:
How a Focus on
Customer Experience
Led to a Focus on Data
Science
Both can be reached from: https://www.oreilly.com/ideas/data-governance-and-the-death-of-schema-on-read
Data Governance and the
Death of Schema on Read
My collaborators
SonalRob
Teja Sean
Vadim Vaks
Principal Solutions
Architect
Gabe
Barbara_Eckman@Cable.Comcast.com

Weitere ähnliche Inhalte

Was ist angesagt?

Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
Amazon Web Services
 

Was ist angesagt? (20)

Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Heterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCTHeterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCT
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
 
AWS Elastic Beanstalk
AWS Elastic BeanstalkAWS Elastic Beanstalk
AWS Elastic Beanstalk
 
IoT & Azure (EventHub)
IoT & Azure (EventHub)IoT & Azure (EventHub)
IoT & Azure (EventHub)
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Data Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsData Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platforms
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 

Ähnlich wie An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 

Ähnlich wie An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas (20)

A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Apache drill
Apache drillApache drill
Apache drill
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas

  • 1. Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws. An architecture for integrated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas Barbara Eckman, Ph.D. Principal Architect
  • 2. Our Group’s Mission Gather and organize metadata and lineage from diverse sources to make data universally discoverable, integrate-able and accessible to empower insight-driven decision making • Dozens of tenants and stakeholders • Millions of messages/second captured • Tens of PB of long term data storage • Thousands of cores of distributed compute
  • 3. Data Discovery and Lineage • Where can I find data about X? • How is this data structured? • Who produced it? • What does it mean? • How ”mature” is it? • What attributes in your data match attributes in mine? (e.g., potential join fields) • How has the data changed in its journey from ingest to where I’m viewing it? • Where are the derivatives of my original data to be found?
  • 4. Challenges of Heterogeneity • There are many excellent data discovery tools -OS and commercial • BUT limited in scope of data set types supported -Only a certain Big Data ecosystem provider -Only RDBMS’s, text documents, emails • We need to add new data set types from multiple providers • We need to integrate metadata from diverse data sets, both in public cloud and on-prem • We need to integrate lineage from diverse loading jobs, both batch and streaming
  • 5. Outline • #TBT** to DataWorks Summit 2017 • Reorganization Yields New Requirements (12/2017) • Challenges encountered and met • New Integrative Data Discovery and Lineage Architecture • Next steps ** “Throw Back Tuesday”
  • 6. #TBT to DataWorks Summit 2017
  • 8. DataWorks Summit 2017: Key Metadata Technologies Apache Avro Atlas.apache.orgAvro.apache.org
  • 9. What are Avro and Atlas? • A data serialization system -A JSON-based schema language -A compact serialized format • APIs in a bunch of languages • Benefits: -Cross-language support for dynamic data access -Simple but expressive schema definition and evolution -Built-in documentation, defaults • Data Discovery, Lineage -Browser UI -Rest/Java and kafka APIs -Synchronous and Asynchonous messaging -Free-text, typed, graph search • Integrated Security (Apache Ranger) • Schema Registry as well as Metadata Repo Open Source Extensible
  • 10. DataWorks Summit 2017: Atlas Metadata Types Built-in Atlas Types Custom Atlas Entities Custom Atlas Processes • DataSet • Process • Hive tables • Kafka topics • Avro Schemas -Reciprocally linked to all other dataset types • Extensions to Kafka topic -sizing parameters • AWS S3 Object Store - S3 Bucket, Pseudo- Directory • Lineage Processes -Avro schema versioning -Storing data to S3 objects • Enrichment Processes on streaming data -Re-publishing to kafka topics
  • 12. Reorganization Yields New Requirements • Integrate on-prem data sources -Traditional warehousing, RDBMS, Atlas on Hadoop • Increase signal/noise ratio • Atlas Connectors for all metadata sources -At rest -Streaming • Lineage capture -Streaming and batch • End-user annotations -Stakeholders, documentation
  • 13. New Data Source Type Generic RDBMS Atlas typedefs • Instance • Database (schema) • Table • Column • Index • Foreign Key Used for: • Informatica Metadata Manager, on top of Teradata EDW • Oracle • Others to come Comments: • Back pointers to parent class at every level of hierarchy • Owned_refs ( aka “composite”) at every level • Load only whitelisted databases to increase signal, reduce noise
  • 14. Connectors for all metadata sources • One codebase for all sources • Differ in means of acquiring metadata, but use the same methods to package data into AtlasEntities for publishing via kafka api -RDBMS -Atlas to Atlas (supports different versions) -Kafka topics -Avro schemas -AWS datalake objects
  • 15. Dataset Lineage capture • Generic lineage process typedef -Used for both batch and streaming lineage capture -Attributes include transforms performed, config parameters -May be subclassed from to add attributes for individual cases • Unlike metadata, lineage capture is event-triggered -In AWS, Cloudwatch event on Glue crawler triggers lambda function -In on-prem hadoop, Inotify event on hdfs triggers microservice • Triggered components assemble requisite info and publish to Atlas lineage connector Acknowledgements: Datalake Team
  • 16. End-user annotations: new tag typedefs Stakeholders • Individuals -Data Business Owner -Data Technical Owner -Data Steward -Delivery Manager -Data Architect • Teams -Delivery Team -Support Team Documentation • array<map<string, string>> -Name -Description -URL Acknowledgements: Portal team
  • 17. Portal UI REST/Asynchronous API Apache Atlas on AWS Integrative Metadata Store for Search Duplicatesmetadatafrommetadatasourcessufficienttoenablediscovery Free-text searchSQL-like search Graph search RDMBS Connector Atlas-to-Atlas ConnectorLineage Connector Other Connectors DataStream Connector New Integrative Data Discovery and Lineage Architecture Drill downtoindividual metastores’ UIsfordeepexploration Drill downtoindividual metastores’ UIsfordeepexploration Metadata Sources Batch Data Ingest Jobs Other Metadata repos ML Pipeline Models Feature Eng Jobs Streaming Data Ingest Jobs Oracle, MySQL, MSSQL, etc Catalog AWS S3 Datalake, Kinesis Streams Avro Schemas Kafka Topics Informatica MDM Teradata On-prem Atlas for Hive Tables Model Connector AWSobject Connector On-premPublic Cloud
  • 18. Challenges encountered Integrating EDW metadata • Problem: -vastly larger size objects than previous, even though integration restricted to whitelist of top priority databases • Solution -Increase config params in java, kafka, atlas -Use kafka api -Build a single AtlasEntity object -Send a single message for each database -Optimize query to get data from Informatica (pivot) Schema on read legacy in on- prem datalake • Problem: -Many, many duplicate hive tables -Unknown lineage/semantic equivalence • Solution: -Data-based ML solution to find dups and equivalences (POC in progress)
  • 20. • End-to-end metadata repository -Models are first-class objects, captured with rich metadata (eg input file schema, feature set schema, model parameters, etc) -Feature engineering jobs are first-class objects, captured with rich metadata (eg model, data quality threshold, input file schema, owner) -Build metadata capture on models and feature engineering jobs into the ML pipeline Metadata repo for discovery and documentation of models Data Lake Meta-Data Data Set A Feature Engineering ML Model Prediction Feature Set B
  • 21. Extending avro schema governance to other schema types •Interactive user app facilitates creation of schemas and enforces compliance with Comcast conventions -Each schema is reviewed and approved by at least one human being •Comcast conventions: -Non-vacuous doc comments required to document every attribute -All attributes must have default values -Unnecessary complexity is discouraged (YAGNI principle) •Library of commonly used subschemas -Available via app, use is encouraged by reviewers
  • 22. An architecture for integrated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas • Throw Back To DataWorks Summit 2017 • Reorganization Yields New Requirements (12/2017) -New Atlas data types -Connectors for all metadata sources -Lineage capture -End-user annotations via Tags • Today’s Architecture • Challenges encountered and met • Next steps •Extra Goodies!
  • 23. Comcast Contributions to Apache Atlas OS Community Jira Ticket Description ATLAS-2694 Avro schema typedef and support for Avro schema evolution in Atlas ATLAS-2696 Typedef extensions for Kafka in Atlas ATLAS-2708 AWS S3 data lake typedefs for Atlas ATLAS-2709 RDBMS typedefs for Atlas ATLAS-2724 UI enhancement for Avro schemas and other JSON-valued attributes (coming soon) https://issues.apache.org/jira/browse/ATLAS-XXXX
  • 24. Suggested Reading Creating A Data-Driven Enterprise in Media Comcast Chapter: How a Focus on Customer Experience Led to a Focus on Data Science Both can be reached from: https://www.oreilly.com/ideas/data-governance-and-the-death-of-schema-on-read Data Governance and the Death of Schema on Read
  • 25. My collaborators SonalRob Teja Sean Vadim Vaks Principal Solutions Architect Gabe