More Related Content Similar to Implementing a Data Lake with Enterprise Grade Data Governance (20) More from Hortonworks (20) Implementing a Data Lake with Enterprise Grade Data Governance1. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Implementing a Data Lake with Enterprise
Grade Data Governance
We do Hadoop.
2. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Your speakers
Andrew Ahn
Governance Product Manager, Hortonworks
Oliver Claude
CMO at Waterline
4. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Data Governance Goals
GOAL: Provide a common approach to
data governance across all systems
and data within the organization
• Transparent
Governance standards & protocols must be
clearly defined and available to all
• Reproducible
Recreate the relevant data landscape at a
point in time
• Auditable
All relevant events and assets but be
traceable with appropriate historical lineage
• Consistent
Compliance practices must be consistent
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Governance
Framework
5. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Challenges WITHIN Hadoop
• No comprehensive governance within
the Hadoop stack
• Mostly disjoint as each project defines its own
future and there is no common framework
• Disparate tools, such as HCatalog, Ranger and
Falcon provide pieces of the overall solution
• No integration with external governance
frameworks
• Difficult to get right because each project
is autonomous and you need insight into
traditional IT
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
6. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Initiative for Hadoop
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Data Governance Initiative
Common
Governance
Framework
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
°
°
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
TWO Requirements
1. Hadoop must snap in to
the existing frameworks
and be a good citizen
2. Hadoop must also provide
governance within its own
stack of technologies
A group of companies dedicated to meeting
these requirements in the open
7. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Common Data Governance Use Cases
Financial Reporting
Chain of custody, Lineage Narratives
Telco
Device log management, Correlation, Analysis, and Mitigation
Retail
Point of sale analysis, Price optimization
Healthcare
30 day measures reporting
9. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
New Project Proposal: Apache Atlas
Apache Atlas
Proposed open source project
aimed at solving the Hadoop
data governance challenge in
the open.
Key Capabilities
• Data Classification
• Metadata Exchange
• Centralized Auditing
• Search & Lineage (Browse)
• Security & Policy Engine
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Essen%al
Timeline
Phase-‐3
• Collaboration Features
• Self Service
• Steward Delegation
• Profiling & Pattern Analysis
• Visualization
Phase-‐2
• Advance audit reporting
• Advanced Policy Engine
• Row / Column Masking
• 3rd party Metadata exchange
1H
2015
GA
• Rest API
• Centralized Taxonomy
• Import / export metadata
• Basic Policy Rules Engine
• Real-time access control
• Column Level Tagging
10. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Data Classification
• Import or define taxonomy business-oriented annotations for data
• Define, annotate, and automate capture of relationships between data sets and underlying
elements including source, target, and derivation processes
• Export metadata to third-party systems
Centralized Auditing
• Capture security access information for every application, process, and interaction with data
• Capture the operational information for execution, steps, and activities
Search & Lineage (Browse)
• Pre-defined navigation paths to explore the data classification and audit information
• Text-based search features locates relevant data and audit event across Data Lake quickly
and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security,
and provenance related information
Security & Policy Engine
• Rationalize compliance policy at runtime based on data classification schemes
• Advanced definition of policies for preventing data derivation based on classification (i.e. re-
identification)
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
11. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Apache Atlas Overview
Knowledge Store
Knowledge store categorized with appropriate business-
oriented taxonomy
• Data sets & objects
• Tables / Columns
• Logical context
• Source, destination
Support exchange of metadata between foundation
components and third-party applications/governance tools
Leverages existing Hadoop metastores
Audit Store
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Knowledge Store
ModelsType-System
Policy RulesTaxonomies
12. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Data Lifecycle Management
Leverage existing investment in Apache Falcon with a
focus on:
• Provenance
• Multi-cluster replication
• Data set retention/eviction
• Late data handling
• Automation
Audit Store
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Data Lifecycle
Management
13. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Audit Store
Historical repository for all governance events
• Security: Access Grant & Deny
• Operational: Data Provenance & Metrics
• Indexed and Searchable
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Audit Store
14. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Security
Integration with HDP Advanced Security investments
to ensure compliance.
Establish global security policies based on data
classification.
Leverages Ranger plug-in architecture for policy
enforcement
Audit Store
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Security
15. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Policy Engine
Runtime rationalization of policies rules with respect to
data asset combinations and time. Fully extensible.
• Metadata based
• Geo based rules
• Time-based rules
• Hive Column Prohibitions
• Preview: Hive Row and Column Masking
Audit Store
ModelsType-System
Taxonomies
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Policy Rules
Policy Engine
16. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
RESTful interface
• Extensible enterprise classification of data assets,
relationships and policies organized in a meaningful
way -- aligned to business organization.
• Supports exploration via user interface
• Supports extensibility via API and CLI exposure
Audit Store
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
18. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Enhanced Audit Store
Historical repository for all governance events
• Immutable file format
• Events Metadata Taggable
• Advanced Reporting
• Security: Access Grant & Deny
• Operational: Data Provenance & Metrics
• Indexed and SearchableModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Audit Store
20. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Data Classification
• Import or define taxonomy business-oriented annotations for data
• Define, annotate, and automate capture of relationships between data sets and underlying
elements including source, target, and derivation processes
• Export metadata to third-party systems
Centralized Auditing
• Capture security access information for every application, process, and interaction with data
• Capture the operational information for execution, steps, and activities
Search & Lineage (Browse)
• Pre-defined navigation paths to explore the data classification and audit information
• Text-based search features locates relevant data and audit event across Data Lake quickly
and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security,
and provenance related information
Security & Policy Engine
• Rationalize compliance policy at runtime based on data classification schemes
• Advanced definition of policies for preventing data derivation based on classification (i.e. re-
identification)
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
21. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Governance Ready Certification Program
Curated group of vendor partners to provide
rich & complete features
Customers choose features that they want to
deploy – a la carte.
Low switching costs !
HDP at core to provide stability and
interoperability
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self
Service
Visual-
ization
22. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Waterline Data improves speed to value and
compliance
Data
Warehouse Offload
Data Science/
Analytics Sandbox
Data Lake
VALUE
CREATION
COST
SAVINGS
Deliver a
Business-Ready
Data Lake
Accelerate Data
Prep Process
Govern Data in
Hadoop
23. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
25. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Rest API
Business Glossary
Automated Classification (Tagging)
Automated Lineage Discovery
Profiling and Data Quality
Schema Discovery
Change Detection and Audit
• Glossary
• Tags
• Lineage
• Models
26. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Visual-ization
Governance Ready Certification Program
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self
Service
Visual-
ization
27. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Imagine shopping on Amazon.com
GOVERNANCE
Inventory
Find and Understand
Provision
28. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Waterline Data is like Amazon.com for data in
Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
33. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
Big Data IT Architect
Deliver a Business-
Ready Data Lake
Data Engineer/Data Scientist
Accelerate Data Prep
Process
CDO/Data Steward
Govern Data in
Hadoop
34. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deliver a business-ready data lake
“It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a
service to help the business find, understand, and govern data in Hadoop.”
Joe DosSantos, EMC Big Data Practice Leader
35. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deliver a business-ready data lake
“It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a
service to help the business find, understand, and govern data in Hadoop.”
Joe DosSantos, EMC Big Data Practice Leader
36. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Accelerate data prep process
“80% of Big Data analytics is data prep, and 80% of data prep is inventorying data.”
Data Engineering Director, Financial Services
37. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Accelerate data prep process
"Waterline Data fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data,
which in turn can help analytic teams provision the right data for their analyses.”
Tony Baer, Principal Analyst, Ovum
38. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Govern data in Hadoop
“Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by
other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any
data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a
data swamp. And without metadata, every subsequent use of data means analysts start from scratch.”
“Gartner Says Beware of the Data Lake Fallacy” post on the Gartner website
39. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Govern data in Hadoop
“The first step to governing Big Data is to build an inventory.”
Sunil Soares, Managing Partner, Information Asset
40. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practice approach to implement an enterprise
grade data lake
6. Monitor and maintain
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise metadata repository
2. Build inventory of data
1. Create and populate landing area
41. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
1. Create and populate landing
area
1
1
• Create Landing directory structure
• Set up ETL processes using
Falcon to orchestrate
• Implement ETL jobs using ETL
tools (Syncsort, Talend,
Informatica, etc), Hadoop tools
(Sqoop, Flume, etc) or FTP
Falcon
42. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
2. Build inventory of data
1. Create and populate landing
area
2
• Crawl the cluster
• Profile files
• Automatically discover technical,
business, and compliance
metadata at a field level
• Create Hive tables as needed
• Import lineage
• Export to Atlas
2
2
Falcon
HCatalog
Atlas
43. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
3
3
• Import business glossary terms
and export new tags and updated
definitions
• Synchronize Atlas and Waterline
Data Inventory
• Export metadata and lineage from
Hadoop to Enterprise repository
Falcon
HCatalog
Atlas
44. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
4. Protect sensitive data
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
4
• Use Waterline Data Inventory to
find sensitive data
• Create access privileges in Ranger
• Encrypt or de-identify
HCatalog
Ranger
Falcon
Atlas
45. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
5
5
5
• Create account with Kerberos,
LDAP, etc.
• Set up ACLs (leverage Ranger)
• Users can browse securely through
Waterline Data Inventory
5
HCatalog
Ranger
Falcon
Atlas
46. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
6. Monitor and maintain
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
• Continue profiling new or changed
files and sync with Atlas
• Continue monitoring for sensitive
data, use Ranger to protect
• Build a folksonomy and
synchronize with business glossary
in Atlas and Enterprise Business
Glossary
HCatalog
Ranger
Falcon
Atlas
47. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
Discover lineage and
business metadata
automatically, and
manage metadata
CDO/Data Steward
Automate cataloging of
data assets at scale,
with secure
provisioning to
business users
Big Data Architect
Find and understand
best-suited and most
trusted data without
having to explore
every file manually
Data Engineer/Data
Scientist/Business Analyst
49. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Next Steps…
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about Waterline Data & Hortonworks
http://hortonworks.com/partner/waterline-data
Joint tutorial: bit.ly/DataLakeTutorial
Modern Data Architecture Paper: go.waterlinedata.com/hw-mda
50. © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SAN JOSE
June 9-11
BRUSSELS
April 15-16
• Deep-dive technical content
• 65+ sessions and 5 tracks
• 1,000 attendees
• Sponsorships Available
• Including Pre and Post event community meetups
and BOFs
• Hadoop training available
• 100+ sessions and 7 tracks
• Deep-dive technical content
• 5,000 attendees
• Sponsorships Available
• Including Pre and Post event community meetups
and BOFs
• Hadoop training available
www.hadoopsummit.org
The Largest Hadoop Community Events in
Europe and North America