Weitere ähnliche Inhalte Ähnlich wie Balancing data democratization with comprehensive information governance: building data citizenship across your data lakes (20) Mehr von DataWorks Summit (20) Kürzlich hochgeladen (20) Balancing data democratization with comprehensive information governance: building data citizenship across your data lakes1. 1 © Hortonworks Inc. 2011–2018. All rights reserved
Balancing data democratization with
comprehensive information governance:
building data citizenship across your data lakes
Sanjeev Mohan Srikanth Venkat
Research Analyst, Data Management Strategies Senior Director, Product Management
Gartner Hortonworks Inc.
2. 2 © Hortonworks Inc. 2011–2018. All rights reserved
Your Presenters….
Srikanth Venkat
Senior Director of Product Management,
Hortonworks Inc.
Security & Governance portfolio products & services
Apache Ranger, Apache Atlas, Apache Knox, Platform Security, &
Hortonworks DataPlane Service – Data Steward Studio(DSS)
@srikvenk
https://www.linkedin.com/in/srikanthvenkat/
Sanjeev Mohan
Research Analyst,
Gartner
Data Management and Analytics
Big Data, Data Governance, Apache Spark, IoT, ML/AI
@sanjmo
https://www.linkedin.com/in/sanjeev-mohan-498119
3. 3 © Hortonworks Inc. 2011–2018. All rights reserved
Business Goals for Governed Data Lake
• Fast track analytics to provide business agility
• Promote collaboration across enterprise roles (knowledge workers, data
scientists, data engineers, analysts, data stewards)
• Provide users with trusted, understandable data to extract business
value
• Scale with data volume cost-effectively optimizing existing resources
(infra)
Structured Data Unstructured Data
Data Lake
Analyzed Data
VARIETY
VOLUME
VELOCITY
VERACITY
VALUE
(Semi-structured Data)
In-memory
technologies
Fast Access
Database
Big Data
Query
Real Time
Use
Analytical
Use Models
4. 4 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
But Compliance Also Need to Be Handled …
Industry Compliance Date Effective Region
General General Data Protection Regulation (GDPR) 25 May 2018 E.U.
Financial Services Basel* (Technically Known as BCBS 239) In effect E.U.
Financial Markets Markets in Financial Instruments Directive (MiFID II) January 2018 U.S.
Financial Services SEC Rule 17a-4 In effect U.S.
Retail/Banking Payment Card Industry Data Security Standard (PCI-DSS) In effect U.S.
Health Health Insurance Portability and Accountability Act (HIPAA) In effect U.S.
General Privacy Amendment (Notifiable Data Breaches) Bill 2016 February 2018 AU
* Basel Committee on Banking Supervision's Regulation No. 239
5. 5 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Governance Has Become a Top Priority for High Quality
Analytics
Governance has gone from last to top concern in less than 24 months
Source: Big Data Maturity Survey (March 2018)
What challenges are you experiencing with big data?
6. 6 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Data Governance Framework for High Quality Analytics
Lineage
Encryption Audit
Profiling
Physical
Classification Prepare
Access
Data Privacy, Security and Access Management
Data Discovery and Curation
CatalogMetadata MDM Archive
Data Management Quality
Ingestion
Consumption
7. 7 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Metadata
Is The
Foundation
For
Analytics
8. 8 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Types of Metadata
Technical
(Definitional)
Schemas
Data types
Data models
Configurations
Functions
Business
(Descriptive)
Metadata mapped to
business relationships
Multiple data sources
to the LOB
Social
(Descriptive)
Metadata about party
data relationships
User-generated
content
Tribal knowledge
Operational
(Descriptive)
Output from processes
ETL or actions on data
Data lineage
Data provenance
(reproducibility)
Performance
9. 9 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Importance of Metadata for High Quality Analytics
• Metadata is used to locate, integrate, access, share, link,
govern and analyze data associated with information assets
• Metadata answers larger questions about data:
• Data Lineage – lifecycle of data in the pipeline
• Relationships – discovering links in data from disparate sources
• Map business functions and consumption to data
• Optimize business processes and IT infrastructure
10. 10 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Data Cataloging for Self Service and Automated Analytics
Capabilities of a Data Catalog Solution
Communicate
shared semantic
meaning
Curate inventory of
information assets
Collaborate for
accountability and
governance
Facilitate, Broker, Enable, Share, Orchestrate
11. 11 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Key Capabilities of a Modern Data Preparation Tool
User Collaboration
and
Operationalization
Data Catalog and Basic
Metadata Management
Data
Transformation
Data
Enrichment
Data Ingestion
and Profiling
Data Structuring
and Modeling
Basic Data
Quality and
Security
Data Source Access/Connectivity
Machine Learning
Multiple Deployment Options
Domain/Vertical Solution
Accelerators
Integration with Data Integration,
Analytics/BI, Data Science,
MDM and Information
Stewardship Solutions
12. 12 © 208 Gartner, Inc. and/or its affiliates. All rights reserved.
Unified Data Governance Reference Architecture
Enriched/
Discovery
Zone
(Data
Transformation)
Consumption
Zone
Raw/
Landing/
Secure
Zone
HDFS/S3/DBMS
Self-Service
Dashboards
Advanced
Analytics
Data Scientists
BI
Analysts
Downstream
Applications
Operational
Analytics
Hive/S3/DBMS HBase/S3/DBMS
Developer
Compliance
Analytics
Data
Steward
Data
Analysts
Profile
Classify
Tokenize
Masking
Data AccessData at Rest
Data in
Motion
Source
API Governance
AD/LDAP/Kerberos
SSO/ACL
RBAC, ABAC
RDBMS/
EDW
Logs/E
mails
Social
Media
IoT
Sensor
File
CSV
Encrypt Encrypt
Index SearchCatalog
Metadata Lineage Auditing
Data Wrangling
Data Quality MDM
Self-Service
Data Prep.
S3 = Amazon Simple Storage Service Hive = Apache Hive HBase = Apache HBase
13. 13 © Hortonworks Inc. 2011–2018. All rights reserved
Using Open Source Tools for Governed Data Lake:
Hortonworks Approach
14. 16 © Hortonworks Inc. 2011–2018. All rights reserved
Data Management in a Data Lake POV – Example Responsibilities
• Maintain data definitions and tiers
• Provide data stewardship
• Specify data quality rules
• Define data protection standards
• Own and acts as SME for data
• Specify requirements for any governance
or management of any semi-structured or
unstructured data
• Enable data lineage capabilities
• Architect solution for data quality rules
and standards to be applied and enforced
• Maintain data management tools to
ensure governance, quality, metadata,
data security, privacy, and chain of
custody
Business Technical (IT)
15. 17 © Hortonworks Inc. 2011–2018. All rights reserved
Hortonworks Governed Data Lake Blueprint
Hortonworks Data Lifecycle Manager
AuthN
SSO
2
4
AuthZ Policy Engine, Entitlements, & Audits
Masking/Filtering
Tokenization
Key Management
(KMS)
Audits
(Lineage, Metadata, Enterprise Catalog, Governance)
5
Metadata & Lineage
TDEBI/
Data Science
tools
RDBMS/
EDW
1
Files
Streams &
Feeds
Batch
CDC
CSV
Semi-JSON
Unstructured
IoT
API
Streaming
7
Data
Analyst/
Data
Scientist
Hortonworks Date Plane Services (DPS) Core
Admin
Hortonworks Data Steward Studio
Data Profile
&
asset collection
Business
Metadata/
tags
Catalog Audit
11
Data
Steward
8
SSO
10
9
Policies
SSO
Incremental
Synchronization
Directory Servers
LDAP/AD/Linux
Audits & policy
metadata
Replication/
DR
Backup/
Restore
Auto-
tiering
Infra
Admin
SSO
9
6
Legend
Metadata Flow
Data Flow
Encryption at Rest
In transit Encryption
HORTONWORKS
DATA PLATFORM (HDP®)
DATA-AT-REST
HORTONWORKS
DATA
FLOW (HDF™)
DATA-IN-MOTION
3
16. 18 © Hortonworks Inc. 2011–2018. All rights reserved
• Store both structured and unstructured data both in raw and “prepared” forms
• Data sourcing and derivation should tie to the use case roadmap
• Capture data from the right sources, at the right frequency, and right quality
• Data can be from internal and external (partners) sources including freely available public data sources
• E.g. Government data sources, Social Media, Weather, etc.
• Govern and document the data pipelines that are built – avoid the data swamp
• Enrich with metadata for promoting collaboration and crowdsourcing
• Just enough data protection
• Data lake will almost always contain some ”sensitive” data – personal data such as PII, PCI, PHI etc.
• Rational security and privacy controls in place
• Support goal of making ”all” data available to “all” teams responsibly for BI/Analytics or
for Data Science
Data Lake Design Best Practices
Governed Data Lake: Trusted Data from All Sources in a Single Place
17. 19 © 2018 Gartner, Inc. and/or its affiliates. All rights reserved.
Begin data governance journey at the PoC stage. Don’t make it an
afterthought
Invest in comprehensive data governance tools
Start with the use case driving greatest business value and demand
and add other use cases over time and across initiatives
Collaborate on improving data quality
Recommendations
18. DISCOVER with Data Steward Studio: Understanding
and unlocking the value of data in hybrid enterprise
data lake environments
When: Tuesday June 19, 4:00 PM - 4:40 PM
Where: Meeting Room 230C
What Is New In Apache Atlas 1.0?
When: Wednesday June 20, 11:00 AM - 11:40 AM
Where: Grand Ballroom 220B
Overview of New Features in Apache Ranger
When: Wednesday June 20, 2:00 PM - 2:40 PM
Where: Executive Ballroom 210B/F
GDPR Crash Course
When: Wednesday June 20, 3:00PM -
6:00PM
Where: Meeting Room 212C/D
Birds of a Feather: Security &
Governance
When: Wednesday June 20, 5:40 PM - 6:50
PM
Where: Executive Ballroom 210B/F
GDPR-Focused Partner Community
Showcase for Apache Ranger and Apache
Atlas
When: Thursday June 21, 9:30 AM - 10:10
AM
Where: Meeting Room 230A
Check Out These Sessions: