SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Understanding Metadata: Why it’s essential to your big
data solution and how to manage it well
Tuesday, June 21, 2016
Ben Sharma | Vikram Sreekanti
Speakers
Ben Sharma, Co-Founder & CEO – Zaloni
---
Ben Sharma is a passionate technologist and thought leader in big data, analytics and
enterprise infrastructure solutions. Having previously worked in technology leadership at
NetApp, Fujitsu and others, Ben's expertise ranges from business development to
production deployment in a wide array of technologies including Hadoop, HBase,
databases, virtualization and storage. Ben is co-author of Architecting Data Lakes and Java
in Telecommunications.
Vikram Sreekanti, Software Engineer – AMPLab, UC Berkeley
Vikram Sreekanti is a software engineer working on research in the AMPLab at UC
Berkeley. A graduate of Berkeley's computer science department, he will begin
his Ph.D. in Fall 2016, working with Joe Hellerstein.
In today’s data environment with structured and unstructured data,
the importance of metadata is increased
•  Metadata allows you to keep track of what data is in the data lake,
its source, its format and its lineage
•  Metadata allows for better change management through Impact Analysis
•  The result is data visibility, reliability and reduced time to insight
for your analytics
Metadata matters in a big data world
Zaloni Proprietary3
Data architecture modernizationTraditionalNew
Data Lake
Sources ETL EDW
Derived
(Transformed)
Discovery Sandbox
EDW
Streaming
Unstructured Data
Various Sources
Zaloni Proprietary
Reporting, BI
Extracts
Data Science
Data Discovery
Reporting, BI
Extracts
4
Data lake reference architecture
Consumption
Zone
Source
System
File Data
DB Data
ETL Extracts
Streaming
Transient
Loading Zone
Raw Data
Refined
Data
Trusted
Data
Discovery
Sandbox
Original unaltered
data attributes
Tokenized Data
APIs
Reference Data Master Data
Data Wrangling
Data Discovery
Exploratory Analytics
Metadata Data Quality Data Catalog Security
Data Lake
Integrate to
common format
Data Validation
Data Cleansing
Aggregations
OLTP or ODS
Enterprise Data
Warehouse
Logs
(or other unstructured
data)
Cloud Services
Business Analysts
Researchers
Data Scientists
Zaloni Proprietary5
•  Reduced time to insight for analytics
•  Modern Data architecture will require a holistic approach to metadata
Metadata improves data visibility and reliability
Type of Metadata Description Example
Technical Captures the form and structure
of each data set
Type of data (text, JSON, Avro), structure
of the data (fields and their types)
Operational Captures lineage, quality, profile
and provenance of the data
Source and target locations of data, size,
number of records, lineage
Business Captures what it all means to the
user
Business names, descriptions, tags,
quality and masking rules
Zaloni Proprietary6
Considerations:
•  Integration with Enterprise Metadata
Management Solutions
•  Automated process for new metadata to
be registered in the Data Lake
•  Data follows the registered metadata
Automated metadata registration
API
check-in
copy to
repository
retrieve
metadata
Enterprise
Metadata
Repositories
END
START
metadata
file
Hadoop Cluster
Edge-node to
Cluster (SFTP)
add tags
origin info,
timestamp, etc.
Metadata
operational
metadata
file
Zaloni Proprietary7
Data lineage example in Bedrock for impact analysis
Zaloni Proprietary8
Metadata enhancing data quality and reliability
Zaloni Proprietary9
Business users can quickly answer questions such as:
Data profiling speeds up data discovery and time to insight
•  How many records does an entity have? What is its total size?
•  What does the activity look like for a specific entity (streaming,
updated monthly, untouched from a year ago)?
•  Is this entity a subset of another entity?
•  Does this entity likely contain duplicates?
•  Does this data apply to my target customers/market?
•  What is the min/max of a particular column?
•  Is this data reliable/does it have enough valid values?
Zaloni Proprietary10
Data profiling example in Mica
Capture profiling metrics for every entity
•  Automatically collect profiling metrics at the:
§  Entity level (e.g., size of data set)
§  Field level (e.g., values, frequency of the field)
•  Visually display metrics with metadata
•  Allow data quality check rules to be created
based on profiling information 
Zaloni Proprietary11
Data catalog example in Mica
Zaloni Proprietary12
•  Logical data lake that can include all tiers of storage:
§  Files, HDFS, Object store in on-premise and cloud environments
•  Data lifecycle management across tiered storage environments
§  Hot -> Warm -> Cold on an entity level based on policies/SLAs
§  Across on-premise and cloud environments
§  Take advantage of various storage technologies
§  Provide data management features to automate scheduling and orchestration
of data movement between heterogeneous storage environments
•  Elastic and on-demand compute for various analytical workloads
Data lifecycle management powered by Metadata
Zaloni Proprietary13
Example: Metadata management in Financial Services
Register/ update
metadata
RDBMS/
Mid Tier
Mainframe
COBOL
Flat files
SAS files
Source Systems
Metadata
repositories
Metadata
Management
solution
Extract/ Read
metadata
Data Ingestion
Data Quality and
Validation
Layout
Standardization
Operational
Metadata
Generation
Layout
Standardization
Data Acquisition
Automation
•  Automated Data Acquisition Framework providing timeliness of data
•  Capture Metadata in all phases: Ingestion, Transformation
•  Integration with Enterprise Metadata Management
•  Integrated Data Quality Analysis
Zaloni Proprietary14
DON’T GO IN THE LAKE WITHOUT US
Grounding Big Data
Vikram Sreekanti
UC Berkeley
REMEMBERING THE PAST
Data Warehouse
Single Source of Truth
Enterprise Information Architecture
Golden Master
…
Truth
Truth
Big data took us to a new world
There were changes in volume, velocity and variety,
which were challenging.
Big data took us to a new world
There were changes in volume, velocity and variety,
which were challenging.
The real challenge now is the meaning and value of data,
which depend critically on context.
Big data took us to a new world
WHAT IS DIFFERENT?
Shift in technology
Data representations
Shift in behavior
Data-driven organizations
Shift in behavior
Data-driven organizations
Data in products
Started with the Internet.
Now, the Internet of Things
By 2017:
marketing spends more on tech than IT does.
Data in marketing
GARTNER GROUP
By 2020:
90% of tech budget controlled outside of IT.
MANY USE CASES
MANY CONSTITUENCIES
MANY INCENTIVES
MANY CONTEXTS
WHAT IS DIFFERENT?
Shift in technology
Data representations
Shift in behavior
Data-driven organizations
Shift in technology
Data representations
Raw data in the data lake
Simplifies capture
Encourages exploration
What does it
mean?
It depends on
the context.
A LITTLE SCENARIO
HDFS
BITS
All the web logs from last year
VIEWS, MODELS, CODE
A script to extract orders. To be used for Market Basket analysis.
VIEWS, MODELS, CODE
A Hive table of orders. To be used for Market Basket analysis.
BITS
All the web logs from last year
VIEWS, MODELS, CODE
Code to extract abandoned user sessions
VIEWS, MODELS, CODE
A retargeting model
A hive table
of orders
A retargeting
model
VIEWS, MODELS, CODE
MANY SCRIPTS
MANY MODELS
MANY APPLICATIONS
MANY CONTEXTS
A broader context for big data
ground
THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT
Application Context
Views, models, code
Behavioral Context
Data lineage & usage
Historical Context
In and over time
APPLICATION CONTEXT
Metadata
Models for interpreting
the data for use
§ Data structures
§ Semantic structures
§ Statistical structures
Theme: An unopinionated model of context
HISTORICAL CONTEXT
Versions
Web logs Code to extract user/
movie rentals
Recommender for movie
licensing
Trends over time
How does a movie
with these features
fare over time?
Point in time
A promising new
movie is similar to older
hot movies at time of
release!
BEHAVIORAL CONTEXT
Why Dora?!
Lineage & Usage
2 4 8 7 9
BEHAVIORAL CONTEXT
Lineage & Usage
Data Science
Recommenders
“You should compare with
book sales from last year.”
Curation Tips
“Logistics staff checks
weather data the 1st
Monday of every month.”
Proactive
Impact Analysis
“The Twitter analysis script
changed. You should check
the boss’ dashboard!”
7
7
9
9
THE BIG CONTEXT
A NEW WORLD NEEDS NEW SERVICES
WHAT ARE WE BUILDING?
Grounding philosophy
§ Start useful, stay useful.
§ Stay general.
§ Design for scale.
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
COMMON GROUND CONTEXT MODEL
Pachyderm Chronos
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
COMMON GROUND
Versions
Models
Usage
An unopinionated context model
COMMON GROUNDModels
Versions
Usage
Versions
Usage
Models
Model Graphs
The metamodel
member k1
member k1:
string
member k2
Object 2
member k1
member k2:
number
member k11:
string member k12
element 1 element 2 element 3
element 1 element 2 element 3
Root
RELATIONAL SCHEMA
JSON DOCUMENT
Schema 1
Table 1
Column 1 Column c
Table t
Column 1 Column d
foreign key
Models
Versions
Usage
Versions
Usage
Models
COMMON GROUNDModels
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
The versioning model
COMMON GROUNDModels
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
The versioning model
a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0
0e9233e8e99cccd6861d304968efa4c945a0b918
3e64220f08374629ad43ca652d4ce7cef0bdbbca
3e0bada008655fe32d7d136eac0a3f333d23ed80fd75a4ba16f96d11f3f954854acc2d739054233
Directed Acyclic Graphs
(partial orders)
In this order
In no particular order
VERSION GRAPHSModels
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
COMMON GROUNDModels
Versions
Usage
Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
Usage Graphs: Lineage
The usage model
USAGE GRAPHS
Everything can participate in usage
Models
Versions
Usage
Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
COMMON GROUNDVersions
Models
Usage
Model Graphs
Version Graphs
Usage Graphs: Lineage
The model
INITIAL FOCUS AREAS
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
INITIAL FOCUS AREAS
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
INITIAL FOCUS AREAS
Parsing &
Featurization
Model
Serving
Reproducibility
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
WorkflowID & Auth
INITIAL FOCUS AREAS
Versioned
Storage
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
Learn more at:
http://www.ground-context.org
@vsreekanti

Weitere ähnliche Inhalte

Was ist angesagt?

Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data LakeRobert Chong
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your ProductDell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your ProductManuel "Manny" Rodriguez-Perez
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake ArchitectureDATAVERSITY
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesDataWorks Summit
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Pervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricityPervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricityCloudera, Inc.
 
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...DATAVERSITY
 
One Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceOne Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceJeffrey T. Pollock
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...Denodo
 

Was ist angesagt? (20)

Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Data Federation
Data FederationData Federation
Data Federation
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your ProductDell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Pervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricityPervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricity
 
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
 
One Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceOne Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and Governance
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
 

Ähnlich wie Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationZaloni
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricNathan Bijnens
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationDenodo
 
Master Meta Data
Master Meta DataMaster Meta Data
Master Meta DataDigikrit
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessInside Analysis
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Saurabh K. Gupta
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Derfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeDerfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeMicrosoft
 
t2_4-architecting-data-for-integration-and-longevity
t2_4-architecting-data-for-integration-and-longevityt2_4-architecting-data-for-integration-and-longevity
t2_4-architecting-data-for-integration-and-longevityJonathan Hamilton Solórzano
 

Ähnlich wie Understanding Metadata: Why it's essential to your big data solution and how to manage it well (20)

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
Master Meta Data
Master Meta DataMaster Meta Data
Master Meta Data
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Derfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeDerfor skal du bruge en DataLake
Derfor skal du bruge en DataLake
 
t2_4-architecting-data-for-integration-and-longevity
t2_4-architecting-data-for-integration-and-longevityt2_4-architecting-data-for-integration-and-longevity
t2_4-architecting-data-for-integration-and-longevity
 

Kürzlich hochgeladen

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Kürzlich hochgeladen (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Understanding Metadata: Why it's essential to your big data solution and how to manage it well

  • 1. Understanding Metadata: Why it’s essential to your big data solution and how to manage it well Tuesday, June 21, 2016 Ben Sharma | Vikram Sreekanti
  • 2. Speakers Ben Sharma, Co-Founder & CEO – Zaloni --- Ben Sharma is a passionate technologist and thought leader in big data, analytics and enterprise infrastructure solutions. Having previously worked in technology leadership at NetApp, Fujitsu and others, Ben's expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is co-author of Architecting Data Lakes and Java in Telecommunications. Vikram Sreekanti, Software Engineer – AMPLab, UC Berkeley Vikram Sreekanti is a software engineer working on research in the AMPLab at UC Berkeley. A graduate of Berkeley's computer science department, he will begin his Ph.D. in Fall 2016, working with Joe Hellerstein.
  • 3. In today’s data environment with structured and unstructured data, the importance of metadata is increased •  Metadata allows you to keep track of what data is in the data lake, its source, its format and its lineage •  Metadata allows for better change management through Impact Analysis •  The result is data visibility, reliability and reduced time to insight for your analytics Metadata matters in a big data world Zaloni Proprietary3
  • 4. Data architecture modernizationTraditionalNew Data Lake Sources ETL EDW Derived (Transformed) Discovery Sandbox EDW Streaming Unstructured Data Various Sources Zaloni Proprietary Reporting, BI Extracts Data Science Data Discovery Reporting, BI Extracts 4
  • 5. Data lake reference architecture Consumption Zone Source System File Data DB Data ETL Extracts Streaming Transient Loading Zone Raw Data Refined Data Trusted Data Discovery Sandbox Original unaltered data attributes Tokenized Data APIs Reference Data Master Data Data Wrangling Data Discovery Exploratory Analytics Metadata Data Quality Data Catalog Security Data Lake Integrate to common format Data Validation Data Cleansing Aggregations OLTP or ODS Enterprise Data Warehouse Logs (or other unstructured data) Cloud Services Business Analysts Researchers Data Scientists Zaloni Proprietary5
  • 6. •  Reduced time to insight for analytics •  Modern Data architecture will require a holistic approach to metadata Metadata improves data visibility and reliability Type of Metadata Description Example Technical Captures the form and structure of each data set Type of data (text, JSON, Avro), structure of the data (fields and their types) Operational Captures lineage, quality, profile and provenance of the data Source and target locations of data, size, number of records, lineage Business Captures what it all means to the user Business names, descriptions, tags, quality and masking rules Zaloni Proprietary6
  • 7. Considerations: •  Integration with Enterprise Metadata Management Solutions •  Automated process for new metadata to be registered in the Data Lake •  Data follows the registered metadata Automated metadata registration API check-in copy to repository retrieve metadata Enterprise Metadata Repositories END START metadata file Hadoop Cluster Edge-node to Cluster (SFTP) add tags origin info, timestamp, etc. Metadata operational metadata file Zaloni Proprietary7
  • 8. Data lineage example in Bedrock for impact analysis Zaloni Proprietary8
  • 9. Metadata enhancing data quality and reliability Zaloni Proprietary9
  • 10. Business users can quickly answer questions such as: Data profiling speeds up data discovery and time to insight •  How many records does an entity have? What is its total size? •  What does the activity look like for a specific entity (streaming, updated monthly, untouched from a year ago)? •  Is this entity a subset of another entity? •  Does this entity likely contain duplicates? •  Does this data apply to my target customers/market? •  What is the min/max of a particular column? •  Is this data reliable/does it have enough valid values? Zaloni Proprietary10
  • 11. Data profiling example in Mica Capture profiling metrics for every entity •  Automatically collect profiling metrics at the: §  Entity level (e.g., size of data set) §  Field level (e.g., values, frequency of the field) •  Visually display metrics with metadata •  Allow data quality check rules to be created based on profiling information  Zaloni Proprietary11
  • 12. Data catalog example in Mica Zaloni Proprietary12
  • 13. •  Logical data lake that can include all tiers of storage: §  Files, HDFS, Object store in on-premise and cloud environments •  Data lifecycle management across tiered storage environments §  Hot -> Warm -> Cold on an entity level based on policies/SLAs §  Across on-premise and cloud environments §  Take advantage of various storage technologies §  Provide data management features to automate scheduling and orchestration of data movement between heterogeneous storage environments •  Elastic and on-demand compute for various analytical workloads Data lifecycle management powered by Metadata Zaloni Proprietary13
  • 14. Example: Metadata management in Financial Services Register/ update metadata RDBMS/ Mid Tier Mainframe COBOL Flat files SAS files Source Systems Metadata repositories Metadata Management solution Extract/ Read metadata Data Ingestion Data Quality and Validation Layout Standardization Operational Metadata Generation Layout Standardization Data Acquisition Automation •  Automated Data Acquisition Framework providing timeliness of data •  Capture Metadata in all phases: Ingestion, Transformation •  Integration with Enterprise Metadata Management •  Integrated Data Quality Analysis Zaloni Proprietary14
  • 15. DON’T GO IN THE LAKE WITHOUT US
  • 16. Grounding Big Data Vikram Sreekanti UC Berkeley
  • 17. REMEMBERING THE PAST Data Warehouse Single Source of Truth Enterprise Information Architecture Golden Master … Truth Truth
  • 18. Big data took us to a new world
  • 19. There were changes in volume, velocity and variety, which were challenging. Big data took us to a new world
  • 20. There were changes in volume, velocity and variety, which were challenging. The real challenge now is the meaning and value of data, which depend critically on context. Big data took us to a new world
  • 21. WHAT IS DIFFERENT? Shift in technology Data representations Shift in behavior Data-driven organizations
  • 23. Data in products Started with the Internet. Now, the Internet of Things
  • 24. By 2017: marketing spends more on tech than IT does. Data in marketing GARTNER GROUP By 2020: 90% of tech budget controlled outside of IT.
  • 25. MANY USE CASES MANY CONSTITUENCIES MANY INCENTIVES MANY CONTEXTS
  • 26. WHAT IS DIFFERENT? Shift in technology Data representations Shift in behavior Data-driven organizations
  • 27. Shift in technology Data representations
  • 28. Raw data in the data lake Simplifies capture Encourages exploration What does it mean? It depends on the context.
  • 30. BITS All the web logs from last year
  • 31. VIEWS, MODELS, CODE A script to extract orders. To be used for Market Basket analysis.
  • 32. VIEWS, MODELS, CODE A Hive table of orders. To be used for Market Basket analysis.
  • 33. BITS All the web logs from last year
  • 34. VIEWS, MODELS, CODE Code to extract abandoned user sessions
  • 35. VIEWS, MODELS, CODE A retargeting model
  • 36. A hive table of orders A retargeting model VIEWS, MODELS, CODE
  • 37.
  • 38. MANY SCRIPTS MANY MODELS MANY APPLICATIONS MANY CONTEXTS
  • 39. A broader context for big data ground
  • 40. THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application Context Views, models, code Behavioral Context Data lineage & usage Historical Context In and over time
  • 41. APPLICATION CONTEXT Metadata Models for interpreting the data for use § Data structures § Semantic structures § Statistical structures Theme: An unopinionated model of context
  • 42. HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie rentals Recommender for movie licensing Trends over time How does a movie with these features fare over time? Point in time A promising new movie is similar to older hot movies at time of release!
  • 44. 2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”
  • 45. 7 7 9 9 THE BIG CONTEXT A NEW WORLD NEEDS NEW SERVICES
  • 46. WHAT ARE WE BUILDING? Grounding philosophy § Start useful, stay useful. § Stay general. § Design for scale.
  • 47. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth
  • 48. Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND
  • 51. member k1 member k1: string member k2 Object 2 member k1 member k2: number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models
  • 56. USAGE GRAPHS Everything can participate in usage Models Versions Usage Models Versions Usage Models Versions Usage Versions Usage Models
  • 57. COMMON GROUNDVersions Models Usage Model Graphs Version Graphs Usage Graphs: Lineage The model
  • 59. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth INITIAL FOCUS AREAS
  • 60. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth INITIAL FOCUS AREAS Parsing & Featurization Model Serving Reproducibility
  • 61. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & WorkflowID & Auth INITIAL FOCUS AREAS Versioned Storage
  • 62. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES