With an ever increasing need to secure and limit access to sensitive data, enterprises today need an open source solution. Apache Atlas - which is the metadata and governance framework for Hadoop joins hands with Apache Ranger - security enforcement framework for Hadoop to address the need for compliance and security. Vimal will discuss the security and compliance requirements and demonstrate how the combination of Atlas and Ranger solves the problem. Vimal will focus on Tag based policy enforcement which is an elegant solution for large Hadoop clusters with wide variety of data
1. Talk Title Here
Author Name, Company
Security and Compliance with
Atlas and Ranger
Vimal Sharma, Hortonworks
2. Agenda
• Apache Atlas
– Introduction
– Architecture
– Cross Component Lineage
• Apache Ranger
– Introduction
– Architecture
• Tag Based Policies
– Use cases and advantages
– Demo
3. Apache Atlas
• Incubated to Apache in May 2015
• Organizations : IBM, Hortonworks, Aetna, Merck
• 3 releases in last year
• Graduated to a Top Level Project in June 2017
0.7
(July 2016)
0.7.1
(Jan 2017)
0.8
(Mar 2017)
TLP
(June 2017)
4. Apache Atlas Introduction
Governance and Metadata framework for Hadoop
Model a component and capture metadata
Data Assets - Hive Table, HBase column family
Process – Hive CTAS, Storm Topology
Classification - Tag metadata entities
Built-in support for popular components
Extensible Architecture
Cross Component Lineage
Export/Import of metadata
6. Cross Component Lineage
• Lineage: Upstream and downstream Data Assets
relationship
• Individual Components : Own Metadata store
• Cross Component events are common
• Atlas : Flexibility to model arbitrary components
– Arbitrary lineage can be captured
HDFS Path
Spark
Process
Kafka
Topic
8. Lineage Use Cases
ETL Pipelines
• Upstream failure analysis
• Alerts to downstream processes
Redundant Processing
• Can metadata classification be used to determine this?
• Avoid expensive processing
Compliance and Security
• Impose security constraints on sensitive data
• Data can span multiple Hadoop components
• One policy to govern them all
9. Apache Ranger Introduction
• Framework to enforce security on Hadoop
• Support for Hive, HBase, YARN and more
• Policies for resources like table, files
• Specific policies for users/groups
• Audit and policy analytics
• Atlas Integration
• Import and export of policies
11. Ranger Plugins
• Reside in component process space
• Periodically poll Ranger Policy Store
• Keep a cache of current policies
• Copy of policies in disk
• Access request evaluated against list of policies
• User request data sent to Audit store
12. Atlas Ranger Integration
• Ranger : Listener on Tag addition/deletion
• Attribute based policies rather than asset based policies
• Advantages
– No need to create/update policies for individual resources
– Resources belonging to multiple components can be tagged
Atlas
Tag - PII
Ranger
TagSync
Enforce
Policies
14. Tag Based Policy Demo
• Define tag EXPIRES_ON in Atlas with attribute
expiry_date
• Attach this tag to Hive tables:
– tax_2010 with expiry_date – Dec 2016
– tax_2015 with expiry_date – Dec 2017
• Data access should be refused for first but allowed for
second
• Inspect Ranger Audit to verify
15. Why Tag based policies?
• Data Stewards
– Mine data to determine qualifying tags - PII, GeoLocation
– Attach tag to resource
– No overlap with admin’s responsibilities
• Lineage – crucial to determine candidate tags
• Tag policies are intact when resources are renamed/deleted
– Tag instances can be removed but tag definition cannot
– Resources may be volatile and so are policies on them
– Migration of tags and policies across clusters
Atlas is a tool to model elements in the Hadoop ecosystem and create objects of those components
Data Assets e.g Hive table
Processes : Storm Topology
Store : Metadata, classifications as tags
Built in support for popular components
Extensible architechture