This document discusses the importance of metadata for big data solutions and data lakes. It begins with introductions of the two speakers, Ben Sharma and Vikram Sreekanti. It then discusses how metadata allows you to track data in the data lake, improve change management and data visibility. The document presents considerations for metadata such as integration with enterprise solutions and automated registration. It provides examples of using metadata for data lineage, quality, and cataloging. Finally, it discusses using metadata across storage tiers for data lifecycle management and providing elastic compute resources.
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Understanding Metadata: Why it's essential to your big data solution and how to manage it well
1. Understanding Metadata: Why it’s essential to your big
data solution and how to manage it well
Tuesday, June 21, 2016
Ben Sharma | Vikram Sreekanti
2. Speakers
Ben Sharma, Co-Founder & CEO – Zaloni
---
Ben Sharma is a passionate technologist and thought leader in big data, analytics and
enterprise infrastructure solutions. Having previously worked in technology leadership at
NetApp, Fujitsu and others, Ben's expertise ranges from business development to
production deployment in a wide array of technologies including Hadoop, HBase,
databases, virtualization and storage. Ben is co-author of Architecting Data Lakes and Java
in Telecommunications.
Vikram Sreekanti, Software Engineer – AMPLab, UC Berkeley
Vikram Sreekanti is a software engineer working on research in the AMPLab at UC
Berkeley. A graduate of Berkeley's computer science department, he will begin
his Ph.D. in Fall 2016, working with Joe Hellerstein.
3. In today’s data environment with structured and unstructured data,
the importance of metadata is increased
• Metadata allows you to keep track of what data is in the data lake,
its source, its format and its lineage
• Metadata allows for better change management through Impact Analysis
• The result is data visibility, reliability and reduced time to insight
for your analytics
Metadata matters in a big data world
Zaloni Proprietary3
4. Data architecture modernizationTraditionalNew
Data Lake
Sources ETL EDW
Derived
(Transformed)
Discovery Sandbox
EDW
Streaming
Unstructured Data
Various Sources
Zaloni Proprietary
Reporting, BI
Extracts
Data Science
Data Discovery
Reporting, BI
Extracts
4
5. Data lake reference architecture
Consumption
Zone
Source
System
File Data
DB Data
ETL Extracts
Streaming
Transient
Loading Zone
Raw Data
Refined
Data
Trusted
Data
Discovery
Sandbox
Original unaltered
data attributes
Tokenized Data
APIs
Reference Data Master Data
Data Wrangling
Data Discovery
Exploratory Analytics
Metadata Data Quality Data Catalog Security
Data Lake
Integrate to
common format
Data Validation
Data Cleansing
Aggregations
OLTP or ODS
Enterprise Data
Warehouse
Logs
(or other unstructured
data)
Cloud Services
Business Analysts
Researchers
Data Scientists
Zaloni Proprietary5
6. • Reduced time to insight for analytics
• Modern Data architecture will require a holistic approach to metadata
Metadata improves data visibility and reliability
Type of Metadata Description Example
Technical Captures the form and structure
of each data set
Type of data (text, JSON, Avro), structure
of the data (fields and their types)
Operational Captures lineage, quality, profile
and provenance of the data
Source and target locations of data, size,
number of records, lineage
Business Captures what it all means to the
user
Business names, descriptions, tags,
quality and masking rules
Zaloni Proprietary6
7. Considerations:
• Integration with Enterprise Metadata
Management Solutions
• Automated process for new metadata to
be registered in the Data Lake
• Data follows the registered metadata
Automated metadata registration
API
check-in
copy to
repository
retrieve
metadata
Enterprise
Metadata
Repositories
END
START
metadata
file
Hadoop Cluster
Edge-node to
Cluster (SFTP)
add tags
origin info,
timestamp, etc.
Metadata
operational
metadata
file
Zaloni Proprietary7
10. Business users can quickly answer questions such as:
Data profiling speeds up data discovery and time to insight
• How many records does an entity have? What is its total size?
• What does the activity look like for a specific entity (streaming,
updated monthly, untouched from a year ago)?
• Is this entity a subset of another entity?
• Does this entity likely contain duplicates?
• Does this data apply to my target customers/market?
• What is the min/max of a particular column?
• Is this data reliable/does it have enough valid values?
Zaloni Proprietary10
11. Data profiling example in Mica
Capture profiling metrics for every entity
• Automatically collect profiling metrics at the:
§ Entity level (e.g., size of data set)
§ Field level (e.g., values, frequency of the field)
• Visually display metrics with metadata
• Allow data quality check rules to be created
based on profiling information
Zaloni Proprietary11
13. • Logical data lake that can include all tiers of storage:
§ Files, HDFS, Object store in on-premise and cloud environments
• Data lifecycle management across tiered storage environments
§ Hot -> Warm -> Cold on an entity level based on policies/SLAs
§ Across on-premise and cloud environments
§ Take advantage of various storage technologies
§ Provide data management features to automate scheduling and orchestration
of data movement between heterogeneous storage environments
• Elastic and on-demand compute for various analytical workloads
Data lifecycle management powered by Metadata
Zaloni Proprietary13
14. Example: Metadata management in Financial Services
Register/ update
metadata
RDBMS/
Mid Tier
Mainframe
COBOL
Flat files
SAS files
Source Systems
Metadata
repositories
Metadata
Management
solution
Extract/ Read
metadata
Data Ingestion
Data Quality and
Validation
Layout
Standardization
Operational
Metadata
Generation
Layout
Standardization
Data Acquisition
Automation
• Automated Data Acquisition Framework providing timeliness of data
• Capture Metadata in all phases: Ingestion, Transformation
• Integration with Enterprise Metadata Management
• Integrated Data Quality Analysis
Zaloni Proprietary14
19. There were changes in volume, velocity and variety,
which were challenging.
Big data took us to a new world
20. There were changes in volume, velocity and variety,
which were challenging.
The real challenge now is the meaning and value of data,
which depend critically on context.
Big data took us to a new world
21. WHAT IS DIFFERENT?
Shift in technology
Data representations
Shift in behavior
Data-driven organizations
40. THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT
Application Context
Views, models, code
Behavioral Context
Data lineage & usage
Historical Context
In and over time
41. APPLICATION CONTEXT
Metadata
Models for interpreting
the data for use
§ Data structures
§ Semantic structures
§ Statistical structures
Theme: An unopinionated model of context
42. HISTORICAL CONTEXT
Versions
Web logs Code to extract user/
movie rentals
Recommender for movie
licensing
Trends over time
How does a movie
with these features
fare over time?
Point in time
A promising new
movie is similar to older
hot movies at time of
release!
44. 2 4 8 7 9
BEHAVIORAL CONTEXT
Lineage & Usage
Data Science
Recommenders
“You should compare with
book sales from last year.”
Curation Tips
“Logistics staff checks
weather data the 1st
Monday of every month.”
Proactive
Impact Analysis
“The Twitter analysis script
changed. You should check
the boss’ dashboard!”
46. WHAT ARE WE BUILDING?
Grounding philosophy
§ Start useful, stay useful.
§ Stay general.
§ Design for scale.
47. ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
48. Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
COMMON GROUND CONTEXT MODEL
Pachyderm Chronos
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
51. member k1
member k1:
string
member k2
Object 2
member k1
member k2:
number
member k11:
string member k12
element 1 element 2 element 3
element 1 element 2 element 3
Root
RELATIONAL SCHEMA
JSON DOCUMENT
Schema 1
Table 1
Column 1 Column c
Table t
Column 1 Column d
foreign key
Models
Versions
Usage
Versions
Usage
Models
59. ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
INITIAL FOCUS AREAS
60. ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
INITIAL FOCUS AREAS
Parsing &
Featurization
Model
Serving
Reproducibility
61. ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
WorkflowID & Auth
INITIAL FOCUS AREAS
Versioned
Storage
62. ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES