SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Mandy Chessell CBE FREng CEng FBCS
Distinguished Engineer, Master Inventor
Analytics Chief Data Office
 mandy_chessell@uk.ibm.com
18th April 2018
Good analytics needs good data and
that needs good metadata
Apache Atlas as an open innovation platform for metadata management and governance3
Agenda
 Why is metadata so important today?
 What is the challenge?
 Building an open ecosystem
 Apache Atlas and the specifics
 ODPI Data Governance PMC
 Progress report and call to action
Apache Atlas as an open innovation platform for metadata management and governance4
Open Data
Site
The perils of reusing data …
Data Lake
Employee
Directory
Callie Quartile uses (1) open data
from the local government registrar
and (2) data from the employee
directory to (3) create a birthday
card service for the company.
Callie Quartile
Data Scientist
1
3
2
Apache Atlas as an open innovation platform for metadata management and governance5
Open Data
Site
The perils of reusing data …
Data Lake
Employee
Directory
Callie Quartile
Data Scientist
1
3
2
Happy
Birthday
But its not my
birthday
Unfortunately the obvious date in the
registrar record was the registration of
birth date not the date of birth. Date
of birth was not published in the open
data.
Callie needed better information about
the open data to realise she had the
wrong data.
Apache Atlas as an open innovation platform for metadata management and governance6
Metadata
should bring
as much
information
about the
data sets to
Callie’s data
science as is
known
collectively
by the
organization.
Employee Directory
NameBand Job Title
X
Data Set Name: Employee
Directory
X
Description:
Core attributes describing all
employees of OCO
pharmaceuticals created from a
daily extract from Kenexa.
Owner: Penny Payer
Status:
Last accessed: 6th May 2016
Records: 3488
Last Update: 1st May 2016
Contents:
Structure …
Contents …
Lineage …
XColumn:
Band
Classification Ranges:
Confidentiality: Public, Confidential,
Sensitive
Confidence: Authoritative
Retention: Indefinitely
Characteristi
cs
LineageDescription
Position reference number for non-
exempt employees. The value ranges
from 01 to 06 where 01 is the most senior
and 06 is the most junior.
Type: String
Classification: Public
Apache Atlas as an open innovation platform for metadata management and governance7
Different personas need different services
Callie Quartile
Data Scientist
Jules Keeper
Chief Data Officer
Find data
Understand data
Manage analytics models
Build data strategy
Define governance program
Monitor progress
Apache Atlas as an open innovation platform for metadata management and governance8
Different personas need different services
Faith Broker
HR and Privacy Officer
Gary Geeke
IT
Locate personal data
Ensure protection of personal data
Understand employee needs
Maintain “safe” IT Infrastructure
Build and deploy “good” APIs and services
Locate and resolve issues fast
Apache Atlas as an open innovation platform for metadata management and governance9
Different personas need different services
Tanya Tidie
Clinical Trials Administrator
Ivor Padlock
Chief Security Officer
Maintain accurate patient records
Catalog clinical trials data
Demonstrate good data management practices
Understand risks to organization
Set up protection
Monitor for suspicious activity
Apache Atlas as an open innovation platform for metadata management and governance10
Scope of metadata for a data driven organization
Glossary Collaboration
Governance
Models and
Reference Data
Metadata
Discovery
Lineage Data Assets
Base Types, Systems
and Infrastructure
Apache Atlas as an open innovation platform for metadata management and governance11
Curation
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
I know
I wonder
what this
means
Apache Atlas as an open innovation platform for metadata management and governance12
Scared to share
Faith Broker
Business Team
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
Faith Broker has been doing some simple analysis
on the HR data of the company. She wants to share
this data with Callie Quartile to do some detailed
work. However, she does not want Callie to see the
sensitive personal information in the record.
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 XXXXX XXX 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 XXXXX XXX 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 XXXXX XXX 27 Code St Harlem NY 1 3
Callie Quartile
Data Scientist
Apache Atlas as an open innovation platform for metadata management and governance13
Business
metadata
Structural
metadata for
a data store
Using glossary function for semantic processing
EMPNAME EMPNO JOBCODE SALARY
EMPLOYEE
RECORD
Employee
Work Location
Annual Salary
Job Title
Employee Id
Employee Name
Hourly Pay Rate
Manager Compensation Plan
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
IS-A IS-A
Sensitive
IS-A
Data
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
Apache Atlas as an open innovation platform for metadata management and governance14
Why do we need metadata?
 Metadata enables data to be used outside of the application that created it.
• Analytics and decision making
• New business applications
• Reporting and compliance
 Metadata describes the format and content of data allowing people to judge which data set
to use for a new project
• Structure
• Meaning
• Origin
• Valid values and quality
• Usage and ownership
• Regulations and classifications that apply
• <more>
 Metadata describes the business context and classification of data allowing automated
governance processes to operate.
Apache Atlas as an open innovation platform for metadata management and governance15
Today’s reality
 Many data platforms do not have metadata support
 Proprietary tools support a range of data sources and governance actions
• No-one supports everything you need and assumes all tools come from their suite
• Each tool starts “empty” requiring effort to populate metadata
• Each tool operates as if it is the only tool
• No integration/interoperability of metadata repositories from different vendors
 Expensive efforts to create an enterprise data catalogue
Apache Atlas as an open innovation platform for metadata management and governance16
Today’s reality
Apache Atlas as an open innovation platform for metadata management and governance17
Manual metadata capture
Apache Atlas as an open innovation platform for metadata management and governance18
Automatic metadata capture
18
Apache Atlas as an open innovation platform for metadata management and governance19
What needs to change?
Open and
Unified Metadata
Apache Atlas as an open innovation platform for metadata management and governance20
A new manifesto for metadata and governance
 Metadata management must be automated
 Metadata management must become ubiquitous
 Metadata must become open and remotely accessible
 Metadata should be used to drive the governance of data
The discovery, maintenance and use of metadata has to be an integral part
of all tools that access, change and move information.
20
Apache Atlas as an open innovation platform for metadata management and governance21
Open metadata management ecosystem
 Peer-to-peer network of repositories
 Metadata stored and managed close
to its source
 Each repository/tool brings unique
value.
 Open, extensible metadata structures
for metadata exchange and federation
– extending coverage of the types of
resources that need to be described.
 Open source infrastructure sharing
cost of development and maintenance
between vendors
 Support for open standards where
available
Collaboration
Space Metadata
Analytics Platform
Metadata
Application
Metadata
Cloud SaaS platform
Metadata
Hadoop Platform
Metadata
Apache Atlas as an open innovation platform for metadata management and governance22
Apache Atlas
http://atlas.apache.org/
 Apache Atlas has just graduated to become a top-level project.
 It began as an incubator open source project on 5th May 2015 to deliver an
open source governance capability focused primarily on the Hadoop platform.
 Apache Atlas is designed to localize operational governance to the operating
data platform such as Hadoop.
 At its heart is a type-agnostic metadata store that can be access through restful
interfaces.
We see Apache Atlas as the reference implementation for open metadata and
governance, for vendors to pick up and use; or test their integration against.
Being open source allows all vendors to enrich/enhance standard.
Apache Atlas as an open innovation platform for metadata management and governance23
Apache Atlas today
Apache Atlas as an open innovation platform for metadata management and governance24
Updates to Apache Atlas  Automation
• Capture of metadata from data platforms,
data movement engines and data
protection engines.
• Exception management and stewardship
 Business Value
• Specialized services for key data roles
such as CDO, Data Scientist, Developer,
DevOps Operator, Asset Owner,
Applications
 Connectivity
• Metadata Highway offering open
metadata exchange, linking and
federation between heterogeneous
metadata repositories.
Apache Atlas as an open innovation platform for metadata management and governance25
Taking guidance from existing metadata standards
 Well-defined
 Complementary
 Integrating
 Decoupled
https://www.w3.org/TR/vocab-dcat/
Apache Atlas as an open innovation platform for metadata management and governance26
Instance representations in the graph
Apache Atlas as an open innovation platform for metadata management and governance27
Open metadata meta-types, types and instances
«relationship»
DataContentForDataSet
*
*
dataContent
supportedDataSets
«entity»
DataSet
createTime : date
modifiedTime : date
«entity»
DataStore
«entity»
Asset
«entity»
GlossaryTerm
«entity»
Referenceable
description : string
expression : string
status : TermAssignmentStatus
confidence : int
steward : string
source : string
«relationship»
SemanticAssignment
*
*
assignedElements
meaning
Apache Atlas as an open innovation platform for metadata management and governance28
Open metadata type model summary
Glossary Collaboration
Governance
Models and
Reference Data
Metadata
Discovery
Lineage Data Assets
4
3
1
5
2
6
7
Base Types, Systems
and Infrastructure
0
Apache Atlas as an open innovation platform for metadata management and governance29
Open metadata type model summary
Policy Metadata (Principles,
Regulations, Standards,
Approaches, Rule Specifications,
Roles and Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Teaming Metadata
(people profiles,
communities, projects,
notebooks, …)
Models and Schemas
4
3
1
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Rollout
2
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7
O-DEF
O-BDL
ConnectorsBasic Types, Infrastructure and Systems
Access
0
Apache Atlas as an open innovation platform for metadata management and governance30
More detail here …
https://cwiki.apache.org/confluence/display/ATLAS/Building+out+the+Open+Metadata+Typesystem
Apache Atlas as an open innovation platform for metadata management and governance31
Metadata and governance digital platform
Open Metadata
and Governance
Reporting
Platform
ETL Platform
Analytics
Platform
Virtualization
Platform
Governance
Platform
Data
Platform
Apache Atlas as an open innovation platform for metadata management and governance32
Types of tools that may integrate with an open metadata
repository
 BI and visualization tools
• locating data assets and related information about them; defining
reports and publishing their metadata; viewing lineage
 Data Science tool
• wanting to find out about data assets available and manage user
lineage of transformations and analytics models – may also manage
metadata for analytics models
 API developer tool
• wanting to understand proper data structures and data meaning to
use for APIs – plus additional governance requirements that need to
be implemented by API because of the data it exchanges.
 Counter-fraud tools
• ad hoc analysis of logs and error reports, setting up rules
 Curator/owner tool
• for managing the curation of assets, providing access, verifying use of
assets, reviewing discovery results and exceptions, approving change
requests.
 Glossary tool
• for subject matter experts and information architects to share
expertise about a particular subject area – may also define structures
and related reference data
 Enterprise architect tools
• defining the data landscape and related systems.
 DevOps tools
• conformance to polices and standards in development
• metadata capture at deployment
• validation of deployment platform requirements
 Data integration engine
• locating appropriate data and component assets, log design lineage,
log operational lineage
 Information Virtualisation tools
• locate appropriate data assets, build views and publish them, add
design lineage, log operational lineage
 Governance tools
• setting up and monitoring governance program, data quality, …
 Stewardship tools
• reviewing assigned exceptions, making data changes and requesting
approval
 Information security tools
• setting up data access policies and enforcement
 Auditor tools
• view compliance reports and validate policies and policy
implementations
Apache Atlas as an open innovation platform for metadata management and governance33
Open Metadata Access Services
Project Management
Community ProfileAsset Catalog
Stewardship Action
Information View
Governance Program
Information Process
Subject Area
Connected Asset Discovery
Governance Engine
Information Protection
Developer
Data Platform
Asset Owner
Information Landscape
Data Science
DevOps
Asset Consumer
Information
Infrastructure
Apache Atlas as an open innovation platform for metadata management and governance34
OMAS service instance
Both call API and notifications
Apache Atlas as an open innovation platform for metadata management and governance35
Inside the server
Open Metadata and Governance (OMAG) Server
Open Metadata Access Services (OMAS)
OMRS Topic
Connector
OMRS Cohort
Registry Store
Connector
OMRS Archive
Connector
OMRS
AuditLog
Connector
OMRS Event
Mapper
Connector
OMRS
Repository
Connector
Server
Configuration
OMAS REST APIs
and Topics
OMAG
Administration
REST APIs
OMRS
Repository
REST APIs
Open Metadata Repository Services (OMRS)
Apache Atlas as an open innovation platform for metadata management and governance36
Inside the server
Open Metadata and Governance (OMAG) Server
Open Metadata Access Services (OMAS)
OMRS Topic
Connector
OMRS Cohort
Registry Store
Connector
OMRS Archive
Connector
OMRS
AuditLog
Connector
OMRS Event
Mapper
Connector
OMRS
Repository
Connector
Server
Configuration
OMAS REST APIs
and Topics
OMAG
Administration
REST APIs
OMRS
Repository
REST APIs
Administration
Enterprise Repository Services
Local Repository
Services
Cohort
Services
Apache Atlas as an open innovation platform for metadata management and governance37
Integration patterns
https://cwiki.apache.org/confluence/display/ATLAS/Integrating+into+the+Open+Metadata+and+Governance+Ecosystem
IBM Information
Governance Catalog
Apache
Atlas
Apache Atlas as an open innovation platform for metadata management and governance38
Caller Pattern
 A metadata tool can access the
consumer-specific APIs to work
with metadata.
 The Access Layer handles the
calls to metadata repositories
connected to the metadata
highway
Apache Atlas as an open innovation platform for metadata management and governance39
Native Pattern
 Native
implementation of
the open
metadata
governance APIs
 Apache Atlas is a
native
implementation of
the open
metadata and
governance APIs.
Apache Atlas as an open innovation platform for metadata management and governance40
Adapter Pattern
 Simple
components plug
into a repository
proxy to connect
in an existing
metadata
repository.
Apache Atlas as an open innovation platform for metadata management and governance41
Plug-in Pattern
 Open Connector Framework (OCF)
• Connectors to data, analytics etc
 Open Discovery Framework (ODF)
• Metadata discovery services
 Governance action Framework (GAF)
• Stewardship services for triage and
remediation of exceptions
Apache Atlas as an open innovation platform for metadata management and governance42
IBM Unified Governance
Apache Atlas as an open innovation platform for metadata management and governance43
Simple cohort
Cohort A
Chief Data Office
Data Lake
Systems of Record
Apache Atlas as an open innovation platform for metadata management and governance44
Multiple Cohorts
Cohort BCohort A
Chief Data Office
Data Lake
Systems of Record
Mobile
Apps
Data
Lake
Systems of
Record
Marketing
Apache Atlas as an open innovation platform for metadata management and governance45
First server
Apache Atlas as an open innovation platform for metadata management and governance46
Establishing contact
Apache Atlas as an open innovation platform for metadata management and governance47
Federated queries
Apache Atlas as an open innovation platform for metadata management and governance48
Caching metadata for availability and performance
Apache Atlas as an open innovation platform for metadata management and governance49
ODPI - co-creation with practitioners
• Compliance assistance and certification
for vendors
• Subject matter experts sharing best
practices and co-creating content packs
https://github.com/odpi/data-governance
Apache Atlas as an open innovation platform for metadata management and governance50
• Your governance program is based on
established practices and definitions
• Allows a broader range of tools in your
organization
• Automated governance processes
protect and manage your data
Your metadata offerings will deliver value
faster as they tap into metadata collected by
other vendor’s tools.
ODPi packages extend your metadata
system’s and tools’ capabilities
Conformance tests minimize your effort in
being compliant with key standards and
regulations.
Customers have increased confidence in your
tools and services due to ODPi certification.
Data Governance Professionals
Vendors
How ODPi Helps
Apache Atlas as an open innovation platform for metadata management and governance51
Summary
 Big data is creating new opportunities and requirements that needs new types
of systems. Data Lakes are just one part of this story.
 Metadata is critical to make the best use of this data for the widest range of
scenarios.
 Most organizations use tools and platforms from many vendors.
 Open standards have had limited take-up
 Can we use open source to create a digital platform that allows vendors to take
advantage of metadata from a broader ecosystem?
• Open Metadata and Governance defines the standards
• Apache Atlas provides the reference implementation
• ODPi helps to build the ecosystem
Apache Atlas as an open innovation platform for metadata management and governance52
Call to action – how can you help?
 Direct contribution to the Apache Atlas and/or ODPi Data Governance projects.
• There are many features that still need to be developed.
 Encouraging your vendors/partners and projects internal to your organization
to embrace the Open Metadata and Governance standards to grow the
ecosystem of data and processing that is assured by metadata and governance
capability.
52
Apache Atlas as an open innovation platform for metadata management and governance53
https://cwiki.apache.org/confluence/display/ATLAS/Atlas+Projects
Apache Atlas as an open innovation platform for metadata management and governance54
zzzz
z
z
z
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 

Was ist angesagt? (20)

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 

Ähnlich wie Inside open metadata—the deep dive

The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...
DataWorks Summit
 
EAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using HadoopEAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using Hadoop
DataWorks Summit
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
RTTS
 

Ähnlich wie Inside open metadata—the deep dive (20)

The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
 
The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...
 
Apache atlas sydney 2017-v4
Apache atlas   sydney 2017-v4Apache atlas   sydney 2017-v4
Apache atlas sydney 2017-v4
 
Manage tracability with Apache Atlas, a flexible metadata repository
Manage tracability with Apache Atlas, a flexible metadata repositoryManage tracability with Apache Atlas, a flexible metadata repository
Manage tracability with Apache Atlas, a flexible metadata repository
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
 
Big Data Discovery
Big Data DiscoveryBig Data Discovery
Big Data Discovery
 
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
 
EAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using HadoopEAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using Hadoop
 
Oracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsOracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast Charts
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and Beyond
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 
Data Blending, Caching and Optimizing
Data Blending, Caching and OptimizingData Blending, Caching and Optimizing
Data Blending, Caching and Optimizing
 
Tapdata Product Intro
Tapdata Product IntroTapdata Product Intro
Tapdata Product Intro
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
Unleashing the power of apache atlas with apache - virtual dataconnector
Unleashing the power of apache atlas with apache  - virtual dataconnectorUnleashing the power of apache atlas with apache  - virtual dataconnector
Unleashing the power of apache atlas with apache - virtual dataconnector
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Inside open metadata—the deep dive

  • 1. Mandy Chessell CBE FREng CEng FBCS Distinguished Engineer, Master Inventor Analytics Chief Data Office  mandy_chessell@uk.ibm.com 18th April 2018 Good analytics needs good data and that needs good metadata
  • 2. Apache Atlas as an open innovation platform for metadata management and governance3 Agenda  Why is metadata so important today?  What is the challenge?  Building an open ecosystem  Apache Atlas and the specifics  ODPI Data Governance PMC  Progress report and call to action
  • 3. Apache Atlas as an open innovation platform for metadata management and governance4 Open Data Site The perils of reusing data … Data Lake Employee Directory Callie Quartile uses (1) open data from the local government registrar and (2) data from the employee directory to (3) create a birthday card service for the company. Callie Quartile Data Scientist 1 3 2
  • 4. Apache Atlas as an open innovation platform for metadata management and governance5 Open Data Site The perils of reusing data … Data Lake Employee Directory Callie Quartile Data Scientist 1 3 2 Happy Birthday But its not my birthday Unfortunately the obvious date in the registrar record was the registration of birth date not the date of birth. Date of birth was not published in the open data. Callie needed better information about the open data to realise she had the wrong data.
  • 5. Apache Atlas as an open innovation platform for metadata management and governance6 Metadata should bring as much information about the data sets to Callie’s data science as is known collectively by the organization. Employee Directory NameBand Job Title X Data Set Name: Employee Directory X Description: Core attributes describing all employees of OCO pharmaceuticals created from a daily extract from Kenexa. Owner: Penny Payer Status: Last accessed: 6th May 2016 Records: 3488 Last Update: 1st May 2016 Contents: Structure … Contents … Lineage … XColumn: Band Classification Ranges: Confidentiality: Public, Confidential, Sensitive Confidence: Authoritative Retention: Indefinitely Characteristi cs LineageDescription Position reference number for non- exempt employees. The value ranges from 01 to 06 where 01 is the most senior and 06 is the most junior. Type: String Classification: Public
  • 6. Apache Atlas as an open innovation platform for metadata management and governance7 Different personas need different services Callie Quartile Data Scientist Jules Keeper Chief Data Officer Find data Understand data Manage analytics models Build data strategy Define governance program Monitor progress
  • 7. Apache Atlas as an open innovation platform for metadata management and governance8 Different personas need different services Faith Broker HR and Privacy Officer Gary Geeke IT Locate personal data Ensure protection of personal data Understand employee needs Maintain “safe” IT Infrastructure Build and deploy “good” APIs and services Locate and resolve issues fast
  • 8. Apache Atlas as an open innovation platform for metadata management and governance9 Different personas need different services Tanya Tidie Clinical Trials Administrator Ivor Padlock Chief Security Officer Maintain accurate patient records Catalog clinical trials data Demonstrate good data management practices Understand risks to organization Set up protection Monitor for suspicious activity
  • 9. Apache Atlas as an open innovation platform for metadata management and governance10 Scope of metadata for a data driven organization Glossary Collaboration Governance Models and Reference Data Metadata Discovery Lineage Data Assets Base Types, Systems and Infrastructure
  • 10. Apache Atlas as an open innovation platform for metadata management and governance11 Curation 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3 00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3 00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3 I know I wonder what this means
  • 11. Apache Atlas as an open innovation platform for metadata management and governance12 Scared to share Faith Broker Business Team 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3 00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3 00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3 Faith Broker has been doing some simple analysis on the HR data of the company. She wants to share this data with Callie Quartile to do some detailed work. However, she does not want Callie to see the sensitive personal information in the record. 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 XXXXX XXX 27 Code St Harlem NY 1 3 00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 XXXXX XXX 27 Code St Harlem NY 1 3 00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 XXXXX XXX 27 Code St Harlem NY 1 3 Callie Quartile Data Scientist
  • 12. Apache Atlas as an open innovation platform for metadata management and governance13 Business metadata Structural metadata for a data store Using glossary function for semantic processing EMPNAME EMPNO JOBCODE SALARY EMPLOYEE RECORD Employee Work Location Annual Salary Job Title Employee Id Employee Name Hourly Pay Rate Manager Compensation Plan HAS-A HAS-A HAS-A HAS-A HAS-A HAS-A IS-A IS-A Sensitive IS-A Data 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
  • 13. Apache Atlas as an open innovation platform for metadata management and governance14 Why do we need metadata?  Metadata enables data to be used outside of the application that created it. • Analytics and decision making • New business applications • Reporting and compliance  Metadata describes the format and content of data allowing people to judge which data set to use for a new project • Structure • Meaning • Origin • Valid values and quality • Usage and ownership • Regulations and classifications that apply • <more>  Metadata describes the business context and classification of data allowing automated governance processes to operate.
  • 14. Apache Atlas as an open innovation platform for metadata management and governance15 Today’s reality  Many data platforms do not have metadata support  Proprietary tools support a range of data sources and governance actions • No-one supports everything you need and assumes all tools come from their suite • Each tool starts “empty” requiring effort to populate metadata • Each tool operates as if it is the only tool • No integration/interoperability of metadata repositories from different vendors  Expensive efforts to create an enterprise data catalogue
  • 15. Apache Atlas as an open innovation platform for metadata management and governance16 Today’s reality
  • 16. Apache Atlas as an open innovation platform for metadata management and governance17 Manual metadata capture
  • 17. Apache Atlas as an open innovation platform for metadata management and governance18 Automatic metadata capture 18
  • 18. Apache Atlas as an open innovation platform for metadata management and governance19 What needs to change? Open and Unified Metadata
  • 19. Apache Atlas as an open innovation platform for metadata management and governance20 A new manifesto for metadata and governance  Metadata management must be automated  Metadata management must become ubiquitous  Metadata must become open and remotely accessible  Metadata should be used to drive the governance of data The discovery, maintenance and use of metadata has to be an integral part of all tools that access, change and move information. 20
  • 20. Apache Atlas as an open innovation platform for metadata management and governance21 Open metadata management ecosystem  Peer-to-peer network of repositories  Metadata stored and managed close to its source  Each repository/tool brings unique value.  Open, extensible metadata structures for metadata exchange and federation – extending coverage of the types of resources that need to be described.  Open source infrastructure sharing cost of development and maintenance between vendors  Support for open standards where available Collaboration Space Metadata Analytics Platform Metadata Application Metadata Cloud SaaS platform Metadata Hadoop Platform Metadata
  • 21. Apache Atlas as an open innovation platform for metadata management and governance22 Apache Atlas http://atlas.apache.org/  Apache Atlas has just graduated to become a top-level project.  It began as an incubator open source project on 5th May 2015 to deliver an open source governance capability focused primarily on the Hadoop platform.  Apache Atlas is designed to localize operational governance to the operating data platform such as Hadoop.  At its heart is a type-agnostic metadata store that can be access through restful interfaces. We see Apache Atlas as the reference implementation for open metadata and governance, for vendors to pick up and use; or test their integration against. Being open source allows all vendors to enrich/enhance standard.
  • 22. Apache Atlas as an open innovation platform for metadata management and governance23 Apache Atlas today
  • 23. Apache Atlas as an open innovation platform for metadata management and governance24 Updates to Apache Atlas  Automation • Capture of metadata from data platforms, data movement engines and data protection engines. • Exception management and stewardship  Business Value • Specialized services for key data roles such as CDO, Data Scientist, Developer, DevOps Operator, Asset Owner, Applications  Connectivity • Metadata Highway offering open metadata exchange, linking and federation between heterogeneous metadata repositories.
  • 24. Apache Atlas as an open innovation platform for metadata management and governance25 Taking guidance from existing metadata standards  Well-defined  Complementary  Integrating  Decoupled https://www.w3.org/TR/vocab-dcat/
  • 25. Apache Atlas as an open innovation platform for metadata management and governance26 Instance representations in the graph
  • 26. Apache Atlas as an open innovation platform for metadata management and governance27 Open metadata meta-types, types and instances «relationship» DataContentForDataSet * * dataContent supportedDataSets «entity» DataSet createTime : date modifiedTime : date «entity» DataStore «entity» Asset «entity» GlossaryTerm «entity» Referenceable description : string expression : string status : TermAssignmentStatus confidence : int steward : string source : string «relationship» SemanticAssignment * * assignedElements meaning
  • 27. Apache Atlas as an open innovation platform for metadata management and governance28 Open metadata type model summary Glossary Collaboration Governance Models and Reference Data Metadata Discovery Lineage Data Assets 4 3 1 5 2 6 7 Base Types, Systems and Infrastructure 0
  • 28. Apache Atlas as an open innovation platform for metadata management and governance29 Open metadata type model summary Policy Metadata (Principles, Regulations, Standards, Approaches, Rule Specifications, Roles and Metrics) Governance Actions and Processes Augmentation MappingImplementation Business Objects and Relationships, Taxonomies and Ontologies Business Attributes Organization Teaming Metadata (people profiles, communities, projects, notebooks, …) Models and Schemas 4 3 1 5 Physical Asset Descriptions (Data stores, APIs, models and components) Asset Collections (Sets, Typed Sets, Type Organized Sets) Information Views Rights Management Reference Data Feedback Metadata (tags, comments, ratings, …) ClassificationSchemes Classification Strategy Subject Area Definition Campaigns and Projects Rollout 2 Discovery Metadata (profile data, technical classification, data classification, data quality assessment, …) Augmentation Instrument Association Information Process Instrumentation (design lineage) 6 7 O-DEF O-BDL ConnectorsBasic Types, Infrastructure and Systems Access 0
  • 29. Apache Atlas as an open innovation platform for metadata management and governance30 More detail here … https://cwiki.apache.org/confluence/display/ATLAS/Building+out+the+Open+Metadata+Typesystem
  • 30. Apache Atlas as an open innovation platform for metadata management and governance31 Metadata and governance digital platform Open Metadata and Governance Reporting Platform ETL Platform Analytics Platform Virtualization Platform Governance Platform Data Platform
  • 31. Apache Atlas as an open innovation platform for metadata management and governance32 Types of tools that may integrate with an open metadata repository  BI and visualization tools • locating data assets and related information about them; defining reports and publishing their metadata; viewing lineage  Data Science tool • wanting to find out about data assets available and manage user lineage of transformations and analytics models – may also manage metadata for analytics models  API developer tool • wanting to understand proper data structures and data meaning to use for APIs – plus additional governance requirements that need to be implemented by API because of the data it exchanges.  Counter-fraud tools • ad hoc analysis of logs and error reports, setting up rules  Curator/owner tool • for managing the curation of assets, providing access, verifying use of assets, reviewing discovery results and exceptions, approving change requests.  Glossary tool • for subject matter experts and information architects to share expertise about a particular subject area – may also define structures and related reference data  Enterprise architect tools • defining the data landscape and related systems.  DevOps tools • conformance to polices and standards in development • metadata capture at deployment • validation of deployment platform requirements  Data integration engine • locating appropriate data and component assets, log design lineage, log operational lineage  Information Virtualisation tools • locate appropriate data assets, build views and publish them, add design lineage, log operational lineage  Governance tools • setting up and monitoring governance program, data quality, …  Stewardship tools • reviewing assigned exceptions, making data changes and requesting approval  Information security tools • setting up data access policies and enforcement  Auditor tools • view compliance reports and validate policies and policy implementations
  • 32. Apache Atlas as an open innovation platform for metadata management and governance33 Open Metadata Access Services Project Management Community ProfileAsset Catalog Stewardship Action Information View Governance Program Information Process Subject Area Connected Asset Discovery Governance Engine Information Protection Developer Data Platform Asset Owner Information Landscape Data Science DevOps Asset Consumer Information Infrastructure
  • 33. Apache Atlas as an open innovation platform for metadata management and governance34 OMAS service instance Both call API and notifications
  • 34. Apache Atlas as an open innovation platform for metadata management and governance35 Inside the server Open Metadata and Governance (OMAG) Server Open Metadata Access Services (OMAS) OMRS Topic Connector OMRS Cohort Registry Store Connector OMRS Archive Connector OMRS AuditLog Connector OMRS Event Mapper Connector OMRS Repository Connector Server Configuration OMAS REST APIs and Topics OMAG Administration REST APIs OMRS Repository REST APIs Open Metadata Repository Services (OMRS)
  • 35. Apache Atlas as an open innovation platform for metadata management and governance36 Inside the server Open Metadata and Governance (OMAG) Server Open Metadata Access Services (OMAS) OMRS Topic Connector OMRS Cohort Registry Store Connector OMRS Archive Connector OMRS AuditLog Connector OMRS Event Mapper Connector OMRS Repository Connector Server Configuration OMAS REST APIs and Topics OMAG Administration REST APIs OMRS Repository REST APIs Administration Enterprise Repository Services Local Repository Services Cohort Services
  • 36. Apache Atlas as an open innovation platform for metadata management and governance37 Integration patterns https://cwiki.apache.org/confluence/display/ATLAS/Integrating+into+the+Open+Metadata+and+Governance+Ecosystem IBM Information Governance Catalog Apache Atlas
  • 37. Apache Atlas as an open innovation platform for metadata management and governance38 Caller Pattern  A metadata tool can access the consumer-specific APIs to work with metadata.  The Access Layer handles the calls to metadata repositories connected to the metadata highway
  • 38. Apache Atlas as an open innovation platform for metadata management and governance39 Native Pattern  Native implementation of the open metadata governance APIs  Apache Atlas is a native implementation of the open metadata and governance APIs.
  • 39. Apache Atlas as an open innovation platform for metadata management and governance40 Adapter Pattern  Simple components plug into a repository proxy to connect in an existing metadata repository.
  • 40. Apache Atlas as an open innovation platform for metadata management and governance41 Plug-in Pattern  Open Connector Framework (OCF) • Connectors to data, analytics etc  Open Discovery Framework (ODF) • Metadata discovery services  Governance action Framework (GAF) • Stewardship services for triage and remediation of exceptions
  • 41. Apache Atlas as an open innovation platform for metadata management and governance42 IBM Unified Governance
  • 42. Apache Atlas as an open innovation platform for metadata management and governance43 Simple cohort Cohort A Chief Data Office Data Lake Systems of Record
  • 43. Apache Atlas as an open innovation platform for metadata management and governance44 Multiple Cohorts Cohort BCohort A Chief Data Office Data Lake Systems of Record Mobile Apps Data Lake Systems of Record Marketing
  • 44. Apache Atlas as an open innovation platform for metadata management and governance45 First server
  • 45. Apache Atlas as an open innovation platform for metadata management and governance46 Establishing contact
  • 46. Apache Atlas as an open innovation platform for metadata management and governance47 Federated queries
  • 47. Apache Atlas as an open innovation platform for metadata management and governance48 Caching metadata for availability and performance
  • 48. Apache Atlas as an open innovation platform for metadata management and governance49 ODPI - co-creation with practitioners • Compliance assistance and certification for vendors • Subject matter experts sharing best practices and co-creating content packs https://github.com/odpi/data-governance
  • 49. Apache Atlas as an open innovation platform for metadata management and governance50 • Your governance program is based on established practices and definitions • Allows a broader range of tools in your organization • Automated governance processes protect and manage your data Your metadata offerings will deliver value faster as they tap into metadata collected by other vendor’s tools. ODPi packages extend your metadata system’s and tools’ capabilities Conformance tests minimize your effort in being compliant with key standards and regulations. Customers have increased confidence in your tools and services due to ODPi certification. Data Governance Professionals Vendors How ODPi Helps
  • 50. Apache Atlas as an open innovation platform for metadata management and governance51 Summary  Big data is creating new opportunities and requirements that needs new types of systems. Data Lakes are just one part of this story.  Metadata is critical to make the best use of this data for the widest range of scenarios.  Most organizations use tools and platforms from many vendors.  Open standards have had limited take-up  Can we use open source to create a digital platform that allows vendors to take advantage of metadata from a broader ecosystem? • Open Metadata and Governance defines the standards • Apache Atlas provides the reference implementation • ODPi helps to build the ecosystem
  • 51. Apache Atlas as an open innovation platform for metadata management and governance52 Call to action – how can you help?  Direct contribution to the Apache Atlas and/or ODPi Data Governance projects. • There are many features that still need to be developed.  Encouraging your vendors/partners and projects internal to your organization to embrace the Open Metadata and Governance standards to grow the ecosystem of data and processing that is assured by metadata and governance capability. 52
  • 52. Apache Atlas as an open innovation platform for metadata management and governance53 https://cwiki.apache.org/confluence/display/ATLAS/Atlas+Projects
  • 53. Apache Atlas as an open innovation platform for metadata management and governance54 zzzz z z z Questions?

Hinweis der Redaktion

  1. Business metadata describes the data that the business needs, what it means and how it should be classified and protected. Structural metadata describes how the data is actually stored and labelled in the data store. The linkage between the business and technical metadata allows our technology to switch between these two perspectives. For example, A request for data expressed in business terminology can be translated into a query for data from a data store. An integration engine copying data into a sand box can discover which are the fields that the business classifies as sensitive and then mask these values dynamically.
  2. AUTOMATED – Metadata is created by application at the same as the data is created in a standard manner easily consumable for all with necessary permissions Device that took the picture / name of picture / settings picture was taken at / location geo tag of picture etc – all automatic – all done at creation of data time
  3. The maintenance of metadata must be automated to scale to the sheer volumes and variety of data involved in modern business.   Metadata management must become ubiquitous in cloud platforms and large data platforms, such as Apache Hadoop so that the processing engines on these platforms can rely on its availability and build capability around it. Metadata access must become open and remotely accessible so that tools from different vendors can work with metadata located on different platforms. This implies unique identifiers for metadata elements, some level of standardization in the types and formats for metadata and standard interfaces for manipulating metadata. Metadata should be used to drive the governance of data and create a business friendly logical interface to the data landscape. Wherever possible, discovery and maintenance of metadata has to an integral part of all tools that access, change and move information.