SlideShare ist ein Scribd-Unternehmen logo
1 von 19
1 © Hortonworks Inc. 2011–2018. All rights reserved
Balancing data democratization with
comprehensive information governance:
building data citizenship across your data lakes
Sanjeev Mohan Srikanth Venkat
Research Analyst, Data Management Strategies Senior Director, Product Management
Gartner Hortonworks Inc.
2 © Hortonworks Inc. 2011–2018. All rights reserved
Your Presenters….
Srikanth Venkat
Senior Director of Product Management,
Hortonworks Inc.
Security & Governance portfolio products & services
Apache Ranger, Apache Atlas, Apache Knox, Platform Security, &
Hortonworks DataPlane Service – Data Steward Studio(DSS)
@srikvenk
https://www.linkedin.com/in/srikanthvenkat/
Sanjeev Mohan
Research Analyst,
Gartner
Data Management and Analytics
Big Data, Data Governance, Apache Spark, IoT, ML/AI
@sanjmo
https://www.linkedin.com/in/sanjeev-mohan-498119
3 © Hortonworks Inc. 2011–2018. All rights reserved
Business Goals for Governed Data Lake
• Fast track analytics to provide business agility
• Promote collaboration across enterprise roles (knowledge workers, data
scientists, data engineers, analysts, data stewards)
• Provide users with trusted, understandable data to extract business
value
• Scale with data volume cost-effectively optimizing existing resources
(infra)
Structured Data Unstructured Data
Data Lake
Analyzed Data
VARIETY
VOLUME
VELOCITY
VERACITY
VALUE
(Semi-structured Data)
In-memory
technologies
Fast Access
Database
Big Data
Query
Real Time
Use
Analytical
Use Models
4 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
But Compliance Also Need to Be Handled …
Industry Compliance Date Effective Region
General General Data Protection Regulation (GDPR) 25 May 2018 E.U.
Financial Services Basel* (Technically Known as BCBS 239) In effect E.U.
Financial Markets Markets in Financial Instruments Directive (MiFID II) January 2018 U.S.
Financial Services SEC Rule 17a-4 In effect U.S.
Retail/Banking Payment Card Industry Data Security Standard (PCI-DSS) In effect U.S.
Health Health Insurance Portability and Accountability Act (HIPAA) In effect U.S.
General Privacy Amendment (Notifiable Data Breaches) Bill 2016 February 2018 AU
* Basel Committee on Banking Supervision's Regulation No. 239
5 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Governance Has Become a Top Priority for High Quality
Analytics
Governance has gone from last to top concern in less than 24 months
Source: Big Data Maturity Survey (March 2018)
What challenges are you experiencing with big data?
6 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Data Governance Framework for High Quality Analytics
Lineage
Encryption Audit
Profiling
Physical
Classification Prepare
Access
Data Privacy, Security and Access Management
Data Discovery and Curation
CatalogMetadata MDM Archive
Data Management Quality
Ingestion
Consumption
7 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Metadata
Is The
Foundation
For
Analytics
8 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Types of Metadata
Technical
(Definitional)
Schemas
Data types
Data models
Configurations
Functions
Business
(Descriptive)
Metadata mapped to
business relationships
Multiple data sources
to the LOB
Social
(Descriptive)
Metadata about party
data relationships
User-generated
content
Tribal knowledge
Operational
(Descriptive)
Output from processes
ETL or actions on data
Data lineage
Data provenance
(reproducibility)
Performance
9 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Importance of Metadata for High Quality Analytics
• Metadata is used to locate, integrate, access, share, link,
govern and analyze data associated with information assets
• Metadata answers larger questions about data:
• Data Lineage – lifecycle of data in the pipeline
• Relationships – discovering links in data from disparate sources
• Map business functions and consumption to data
• Optimize business processes and IT infrastructure
10 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Data Cataloging for Self Service and Automated Analytics
Capabilities of a Data Catalog Solution
Communicate
shared semantic
meaning
Curate inventory of
information assets
Collaborate for
accountability and
governance
Facilitate, Broker, Enable, Share, Orchestrate
11 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Key Capabilities of a Modern Data Preparation Tool
User Collaboration
and
Operationalization
Data Catalog and Basic
Metadata Management
Data
Transformation
Data
Enrichment
Data Ingestion
and Profiling
Data Structuring
and Modeling
Basic Data
Quality and
Security
 Data Source Access/Connectivity
 Machine Learning
 Multiple Deployment Options
 Domain/Vertical Solution
Accelerators
 Integration with Data Integration,
Analytics/BI, Data Science,
MDM and Information
Stewardship Solutions
12 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Unified Data Governance Reference Architecture
Enriched/
Discovery
Zone
(Data
Transformation)
Consumption
Zone
Raw/
Landing/
Secure
Zone
HDFS/S3/DBMS
Self-Service
Dashboards
Advanced
Analytics
Data Scientists
BI
Analysts
Downstream
Applications
Operational
Analytics
Hive/S3/DBMS HBase/S3/DBMS
Developer
Compliance
Analytics
Data
Steward
Data
Analysts
Profile
Classify
Tokenize
Masking
Data AccessData at Rest
Data in
Motion
Source
API Governance
AD/LDAP/Kerberos
SSO/ACL
RBAC, ABAC
RDBMS/
EDW
Logs/E
mails
Social
Media
IoT
Sensor
File
CSV
Encrypt Encrypt
Index SearchCatalog
Metadata Lineage Auditing
Data Wrangling
Data Quality MDM
Self-Service
Data Prep.
S3 = Amazon Simple Storage Service Hive = Apache Hive HBase = Apache HBase
13 © Hortonworks Inc. 2011–2018. All rights reserved
Using Open Source Tools for Governed Data Lake:
Hortonworks Approach
16 © Hortonworks Inc. 2011–2018. All rights reserved
Data Management in a Data Lake POV – Example Responsibilities
• Maintain data definitions and tiers
• Provide data stewardship
• Specify data quality rules
• Define data protection standards
• Own and acts as SME for data
• Specify requirements for any governance
or management of any semi-structured or
unstructured data
• Enable data lineage capabilities
• Architect solution for data quality rules
and standards to be applied and enforced
• Maintain data management tools to
ensure governance, quality, metadata,
data security, privacy, and chain of
custody
Business Technical (IT)
17 © Hortonworks Inc. 2011–2018. All rights reserved
Hortonworks Governed Data Lake Blueprint
Hortonworks Data Lifecycle Manager
AuthN
SSO
2
4
AuthZ Policy Engine, Entitlements, & Audits
Masking/Filtering
Tokenization
Key Management
(KMS)
Audits
(Lineage, Metadata, Enterprise Catalog, Governance)
5
Metadata & Lineage
TDEBI/
Data Science
tools
RDBMS/
EDW
1
Files
Streams &
Feeds
Batch
CDC
CSV
Semi-JSON
Unstructured
IoT
API
Streaming
7
Data
Analyst/
Data
Scientist
Hortonworks Date Plane Services (DPS) Core
Admin
Hortonworks Data Steward Studio
Data Profile
&
asset collection
Business
Metadata/
tags
Catalog Audit
11
Data
Steward
8
SSO
10
9
Policies
SSO
Incremental
Synchronization
Directory Servers
LDAP/AD/Linux
Audits & policy
metadata
Replication/
DR
Backup/
Restore
Auto-
tiering
Infra
Admin
SSO
9
6
Legend
Metadata Flow
Data Flow
Encryption at Rest
In transit Encryption
HORTONWORKS
DATA PLATFORM (HDP®)
DATA-AT-REST
HORTONWORKS
DATA
FLOW (HDF™)
DATA-IN-MOTION
3
18 © Hortonworks Inc. 2011–2018. All rights reserved
• Store both structured and unstructured data both in raw and “prepared” forms
• Data sourcing and derivation should tie to the use case roadmap
• Capture data from the right sources, at the right frequency, and right quality
• Data can be from internal and external (partners) sources including freely available public data sources
• E.g. Government data sources, Social Media, Weather, etc.
• Govern and document the data pipelines that are built – avoid the data swamp
• Enrich with metadata for promoting collaboration and crowdsourcing
• Just enough data protection
• Data lake will almost always contain some ”sensitive” data – personal data such as PII, PCI, PHI etc.
• Rational security and privacy controls in place
• Support goal of making ”all” data available to “all” teams responsibly for BI/Analytics or
for Data Science
Data Lake Design Best Practices
Governed Data Lake: Trusted Data from All Sources in a Single Place
19 © 2018 Gartner, Inc. and/or its affiliates. All rights reserved.
 Begin data governance journey at the PoC stage. Don’t make it an
afterthought
 Invest in comprehensive data governance tools
 Start with the use case driving greatest business value and demand
and add other use cases over time and across initiatives
 Collaborate on improving data quality
Recommendations
DISCOVER with Data Steward Studio: Understanding
and unlocking the value of data in hybrid enterprise
data lake environments
When: Tuesday June 19, 4:00 PM - 4:40 PM
Where: Meeting Room 230C
What Is New In Apache Atlas 1.0?
When: Wednesday June 20, 11:00 AM - 11:40 AM
Where: Grand Ballroom 220B
Overview of New Features in Apache Ranger
When: Wednesday June 20, 2:00 PM - 2:40 PM
Where: Executive Ballroom 210B/F
GDPR Crash Course
When: Wednesday June 20, 3:00PM -
6:00PM
Where: Meeting Room 212C/D
Birds of a Feather: Security &
Governance
When: Wednesday June 20, 5:40 PM - 6:50
PM
Where: Executive Ballroom 210B/F
GDPR-Focused Partner Community
Showcase for Apache Ranger and Apache
Atlas
When: Thursday June 21, 9:30 AM - 10:10
AM
Where: Meeting Room 230A
Check Out These Sessions:
21 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 

Was ist angesagt? (20)

Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data Audit Approach To Developing An Enterprise Data Strategy
Data Audit Approach To Developing An Enterprise Data StrategyData Audit Approach To Developing An Enterprise Data Strategy
Data Audit Approach To Developing An Enterprise Data Strategy
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Migration to Azure
Data Migration to AzureData Migration to Azure
Data Migration to Azure
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern Applications
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 

Ähnlich wie Balancing data democratization with comprehensive information governance: building data citizenship across your data lakes

CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowale
Capgemini
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
Ryan Andhavarapu
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
Aggregage
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 

Ähnlich wie Balancing data democratization with comprehensive information governance: building data citizenship across your data lakes (20)

Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowale
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your data
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraFrom Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
 
Delivering Analytics at The Speed of Transactions with Data Fabric
Delivering Analytics at The Speed of Transactions with Data FabricDelivering Analytics at The Speed of Transactions with Data Fabric
Delivering Analytics at The Speed of Transactions with Data Fabric
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
 
What's New in Pentaho 7.0?
What's New in Pentaho 7.0?What's New in Pentaho 7.0?
What's New in Pentaho 7.0?
 
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Replacing Tape Backup with Cloud-Enabled Solutions by Index EnginesReplacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Intro of Key Features of SoftCAAT BI SQL Software
Intro of Key Features of SoftCAAT BI SQL SoftwareIntro of Key Features of SoftCAAT BI SQL Software
Intro of Key Features of SoftCAAT BI SQL Software
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Balancing data democratization with comprehensive information governance: building data citizenship across your data lakes

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Balancing data democratization with comprehensive information governance: building data citizenship across your data lakes Sanjeev Mohan Srikanth Venkat Research Analyst, Data Management Strategies Senior Director, Product Management Gartner Hortonworks Inc.
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Your Presenters…. Srikanth Venkat Senior Director of Product Management, Hortonworks Inc. Security & Governance portfolio products & services Apache Ranger, Apache Atlas, Apache Knox, Platform Security, & Hortonworks DataPlane Service – Data Steward Studio(DSS) @srikvenk https://www.linkedin.com/in/srikanthvenkat/ Sanjeev Mohan Research Analyst, Gartner Data Management and Analytics Big Data, Data Governance, Apache Spark, IoT, ML/AI @sanjmo https://www.linkedin.com/in/sanjeev-mohan-498119
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Business Goals for Governed Data Lake • Fast track analytics to provide business agility • Promote collaboration across enterprise roles (knowledge workers, data scientists, data engineers, analysts, data stewards) • Provide users with trusted, understandable data to extract business value • Scale with data volume cost-effectively optimizing existing resources (infra) Structured Data Unstructured Data Data Lake Analyzed Data VARIETY VOLUME VELOCITY VERACITY VALUE (Semi-structured Data) In-memory technologies Fast Access Database Big Data Query Real Time Use Analytical Use Models
  • 4. 4 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. But Compliance Also Need to Be Handled … Industry Compliance Date Effective Region General General Data Protection Regulation (GDPR) 25 May 2018 E.U. Financial Services Basel* (Technically Known as BCBS 239) In effect E.U. Financial Markets Markets in Financial Instruments Directive (MiFID II) January 2018 U.S. Financial Services SEC Rule 17a-4 In effect U.S. Retail/Banking Payment Card Industry Data Security Standard (PCI-DSS) In effect U.S. Health Health Insurance Portability and Accountability Act (HIPAA) In effect U.S. General Privacy Amendment (Notifiable Data Breaches) Bill 2016 February 2018 AU * Basel Committee on Banking Supervision's Regulation No. 239
  • 5. 5 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Governance Has Become a Top Priority for High Quality Analytics Governance has gone from last to top concern in less than 24 months Source: Big Data Maturity Survey (March 2018) What challenges are you experiencing with big data?
  • 6. 6 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Data Governance Framework for High Quality Analytics Lineage Encryption Audit Profiling Physical Classification Prepare Access Data Privacy, Security and Access Management Data Discovery and Curation CatalogMetadata MDM Archive Data Management Quality Ingestion Consumption
  • 7. 7 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Metadata Is The Foundation For Analytics
  • 8. 8 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Types of Metadata Technical (Definitional) Schemas Data types Data models Configurations Functions Business (Descriptive) Metadata mapped to business relationships Multiple data sources to the LOB Social (Descriptive) Metadata about party data relationships User-generated content Tribal knowledge Operational (Descriptive) Output from processes ETL or actions on data Data lineage Data provenance (reproducibility) Performance
  • 9. 9 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Importance of Metadata for High Quality Analytics • Metadata is used to locate, integrate, access, share, link, govern and analyze data associated with information assets • Metadata answers larger questions about data: • Data Lineage – lifecycle of data in the pipeline • Relationships – discovering links in data from disparate sources • Map business functions and consumption to data • Optimize business processes and IT infrastructure
  • 10. 10 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Data Cataloging for Self Service and Automated Analytics Capabilities of a Data Catalog Solution Communicate shared semantic meaning Curate inventory of information assets Collaborate for accountability and governance Facilitate, Broker, Enable, Share, Orchestrate
  • 11. 11 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Key Capabilities of a Modern Data Preparation Tool User Collaboration and Operationalization Data Catalog and Basic Metadata Management Data Transformation Data Enrichment Data Ingestion and Profiling Data Structuring and Modeling Basic Data Quality and Security  Data Source Access/Connectivity  Machine Learning  Multiple Deployment Options  Domain/Vertical Solution Accelerators  Integration with Data Integration, Analytics/BI, Data Science, MDM and Information Stewardship Solutions
  • 12. 12 © 208 Gartner, Inc. and/or its affiliates. All rights reserved. Unified Data Governance Reference Architecture Enriched/ Discovery Zone (Data Transformation) Consumption Zone Raw/ Landing/ Secure Zone HDFS/S3/DBMS Self-Service Dashboards Advanced Analytics Data Scientists BI Analysts Downstream Applications Operational Analytics Hive/S3/DBMS HBase/S3/DBMS Developer Compliance Analytics Data Steward Data Analysts Profile Classify Tokenize Masking Data AccessData at Rest Data in Motion Source API Governance AD/LDAP/Kerberos SSO/ACL RBAC, ABAC RDBMS/ EDW Logs/E mails Social Media IoT Sensor File CSV Encrypt Encrypt Index SearchCatalog Metadata Lineage Auditing Data Wrangling Data Quality MDM Self-Service Data Prep. S3 = Amazon Simple Storage Service Hive = Apache Hive HBase = Apache HBase
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Using Open Source Tools for Governed Data Lake: Hortonworks Approach
  • 14. 16 © Hortonworks Inc. 2011–2018. All rights reserved Data Management in a Data Lake POV – Example Responsibilities • Maintain data definitions and tiers • Provide data stewardship • Specify data quality rules • Define data protection standards • Own and acts as SME for data • Specify requirements for any governance or management of any semi-structured or unstructured data • Enable data lineage capabilities • Architect solution for data quality rules and standards to be applied and enforced • Maintain data management tools to ensure governance, quality, metadata, data security, privacy, and chain of custody Business Technical (IT)
  • 15. 17 © Hortonworks Inc. 2011–2018. All rights reserved Hortonworks Governed Data Lake Blueprint Hortonworks Data Lifecycle Manager AuthN SSO 2 4 AuthZ Policy Engine, Entitlements, & Audits Masking/Filtering Tokenization Key Management (KMS) Audits (Lineage, Metadata, Enterprise Catalog, Governance) 5 Metadata & Lineage TDEBI/ Data Science tools RDBMS/ EDW 1 Files Streams & Feeds Batch CDC CSV Semi-JSON Unstructured IoT API Streaming 7 Data Analyst/ Data Scientist Hortonworks Date Plane Services (DPS) Core Admin Hortonworks Data Steward Studio Data Profile & asset collection Business Metadata/ tags Catalog Audit 11 Data Steward 8 SSO 10 9 Policies SSO Incremental Synchronization Directory Servers LDAP/AD/Linux Audits & policy metadata Replication/ DR Backup/ Restore Auto- tiering Infra Admin SSO 9 6 Legend Metadata Flow Data Flow Encryption at Rest In transit Encryption HORTONWORKS DATA PLATFORM (HDP®) DATA-AT-REST HORTONWORKS DATA FLOW (HDF™) DATA-IN-MOTION 3
  • 16. 18 © Hortonworks Inc. 2011–2018. All rights reserved • Store both structured and unstructured data both in raw and “prepared” forms • Data sourcing and derivation should tie to the use case roadmap • Capture data from the right sources, at the right frequency, and right quality • Data can be from internal and external (partners) sources including freely available public data sources • E.g. Government data sources, Social Media, Weather, etc. • Govern and document the data pipelines that are built – avoid the data swamp • Enrich with metadata for promoting collaboration and crowdsourcing • Just enough data protection • Data lake will almost always contain some ”sensitive” data – personal data such as PII, PCI, PHI etc. • Rational security and privacy controls in place • Support goal of making ”all” data available to “all” teams responsibly for BI/Analytics or for Data Science Data Lake Design Best Practices Governed Data Lake: Trusted Data from All Sources in a Single Place
  • 17. 19 © 2018 Gartner, Inc. and/or its affiliates. All rights reserved.  Begin data governance journey at the PoC stage. Don’t make it an afterthought  Invest in comprehensive data governance tools  Start with the use case driving greatest business value and demand and add other use cases over time and across initiatives  Collaborate on improving data quality Recommendations
  • 18. DISCOVER with Data Steward Studio: Understanding and unlocking the value of data in hybrid enterprise data lake environments When: Tuesday June 19, 4:00 PM - 4:40 PM Where: Meeting Room 230C What Is New In Apache Atlas 1.0? When: Wednesday June 20, 11:00 AM - 11:40 AM Where: Grand Ballroom 220B Overview of New Features in Apache Ranger When: Wednesday June 20, 2:00 PM - 2:40 PM Where: Executive Ballroom 210B/F GDPR Crash Course When: Wednesday June 20, 3:00PM - 6:00PM Where: Meeting Room 212C/D Birds of a Feather: Security & Governance When: Wednesday June 20, 5:40 PM - 6:50 PM Where: Executive Ballroom 210B/F GDPR-Focused Partner Community Showcase for Apache Ranger and Apache Atlas When: Thursday June 21, 9:30 AM - 10:10 AM Where: Meeting Room 230A Check Out These Sessions:
  • 19. 21 © Hortonworks Inc. 2011–2018. All rights reserved Questions?