SlideShare ist ein Scribd-Unternehmen logo
1 von 40
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Fine-Grained Security
for Spark and Hive
Carter Shanklin - Director PM
Don Bosco Durai - Security Architect
June 29, 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
● Current security options and challenges
● Apache Ranger Overview
● LLAP Overview
● Use Cases and Demo
● Apache Atlas Integration
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Options and Challenges
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Options and Challenges
⬢ Limited to storage level access control for Spark, Pig and MR
⬢ Column Level Access via HiveServer2
⬢ Row Level filtering need Hive Views
– Multiple Hive Views needs to be created and managed
– Explicit permissions need to be given for each view/user
– User need to know which view to use
⬢ Masking needs custom UDF
– Needs to be wrapped using Views
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
• Central audit location for all
access requests
• Support multiple destination
sources (HDFS, Solr, etc.)
• Real-time visual query
interface
AuditingAuthorization
• Store and manage
encryption keys
• Support HDFS TDE
• Integration with HSM
Ranger KMS
• Centralized platform to
define, administer and
manage security policies
consistently
• Enforce policies within each
component
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
Ranger Architecture
HDFS
Ranger Administration Portal
HBase
Hive Server2
Ranger Audit
Server
Ranger
Plugin
HadoopComponentsEnterprise
Users
Ranger
Plugin
Ranger
Plugin
Legacy Tools and Data
Governance
HDFS
Knox
NifI
Ranger
Plugin
Ranger
Plugin
RDBMS
Solr
Ranger
Plugin
Ranger Policy
Server Integration API
Kafka
Ranger
Plugin
YARN
Ranger
Plugin
Ranger
Plugin
Storm
Ranger
Plugin
Atlas
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Audits - Data Access
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Audits - Admin Actions
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Overview
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 and LLAP
⬢ At a High Level:
– 2000+ features, improvements and bug
fixes in Hive since HDP 2.4.
– 600+ of these from outside of
Hortonworks.
⬢ Major Improvements:
– Preview: Hive LLAP: Persistent query
servers with intelligent in-memory
caching.
– ACID GA: Hardened and proven at scale.
– Expanded SQL Compliance: More capable
integration with BI tools.
– Performance: Interactive query, 2x faster
ETL.
– Security: Row / Column security
extending to views, Column level security
for Spark.
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Open Interfaces
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Integration with Hive and LLAP
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive / LLAP Security Capabilities with Ranger
⬢ Ranger Hive plugin provides authorization / access controls.
⬢ Column Masking:
– Inject Hive UDFs that mask characters or hash values.
– Dynamic, per-user.
⬢ Dynamic Row Filtering:
– Query is analyzed and policies applied.
– Dynamic, per-user.
⬢ All operations run as ordinary SQL queries:
– Masking statements convert to clauses in the SQL select clause.
– Filters convert to clauses in the SQL where clause.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Native Hive Masking Capabilities
UDF Purpose Example Start Example Result
mask Convert letters to X/x and
numbers to n.
123 Fake St. nnn Xxxx Xx.
mask_first_n Mask only the first n
characters.
433-54-3937 nnn-54-3937
mask_last_n Mask only the last n
characters.
433-54-3937 433-54-nnnn
mask_show_first_n Mask, showing only the first
n characters.
555-233-1234 555-nnn-nnnn
mask_show_last_n Mask, showing only the last
n characters.
433-54-3937 nnn-nn-3937
mask_hash Produce a consistent hash of
the field.
CA 21f241cccaa5cfa33190f56ff1510e37
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Delivering Spark Security
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Features: Spark Column Security with LLAP
⬢ Fine-Grained Column Level Access Control for SparkSQL.
⬢ Fully dynamic policies per user. Doesn’t require views.
⬢ Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from
HiveServer and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query
plan based on dynamic security
policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed
by LLAP server.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Per-User Row Filtering by Region in SparkSQL
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Setup
⬢Customer User and Sales data in ORC (Metadata in MetaStore)
⬢Data can be access via SparkSQL or HiveServer2
⬢Marketing needs access to Sales and Users data for analytics
⬢Fraud Investigation team needs access to data for fraud detection
⬢Billing team needs access to Sales and Users data for billing
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
Sales
customer_id
product_id
promotion_id
cookie_id
tracking_id
Group Users
Fraud frank
Marketing mark
Billing bill
Tables
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 1: Restricting Column Access
This is a simple use case where certain groups or users don’t permission to view
the query
⬢Billing group has access to all columns in table Users
⬢Marketing group can’t access credit card column from table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column customer_phone customer_ccn
bill (Billing) 😀 😀
mark (Marketing) 😀 😡
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns - Results
bill
from
Billing
mark
from
Marketing
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns - Audit Screen
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 2: Column Masking
In this use case where certain groups or users won't be able to see the real
value of certain columns.
⬢Billing group can see the real/raw values for all columns in table Users
⬢Fraud group can only see masked values of PII and PCI fields from table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column customer_email,
customer_phone,
customer_ccn
bill (Billing) 😀
frank (Fraud) 😎
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Mask Fields
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Column Masking - Results
bill
from
Billing
frank
from
Fraud
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Column Masking - Audit Screen
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 3: Row Filtering
In this use case where certain groups or users won't be able to see all the rows
from certain tables
⬢Billing group can see all the rows in the table Users
⬢Marketing can only see rows/data from their region in the table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column Rows in Users table
bill (Billing) 😀
Mark (Marketing-
CA)
Only CA Users
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Row Filtering
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Row Filtering - Results
bill
from
Billing
mark
from
Marketing
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 4: Row Filtering - Cross Table
This an extension of previous use cases, where the context information for
filtering the row is in another table.
⬢Billing group can see all the rows in the table Sales
⬢Marketing can only see rows/data from their region in the table Sales,
however Sales table doesn’t have the customer geographic information, so it
needs to be derived from Users table
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column Rows in Sales table
bill (Billing) 😀
Mark (Marketing-
CA)
Only CA Users
Sales
customer_id
product_id
promotion_id
cookie_id
tracking_id
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Row Filtering - Cross Table
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Integration
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cross Product Symbiosis
Apache
Atlas
Apache
Ranger
LLAP
Classification/
Tagging
Governance
Lineage
Tag Based
Policies
Dynamic Custom
Policies
Enforcement hooks
HDFS S3
Meta
Store
* Column Masking and Row Filtering not yet supported by tag based policy
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger - Tag Based Policies
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
DataWorks Summit
 
YARN - Past, Present, & Future
YARN - Past, Present, & FutureYARN - Past, Present, & Future
YARN - Past, Present, & Future
DataWorks Summit
 

Was ist angesagt? (20)

Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
 
Why is my Hadoop* job slow?
Why is my Hadoop* job slow?Why is my Hadoop* job slow?
Why is my Hadoop* job slow?
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
 
Hadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHadoop and Spark – Perfect Together
Hadoop and Spark – Perfect Together
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
YARN - Past, Present, & Future
YARN - Past, Present, & FutureYARN - Past, Present, & Future
YARN - Past, Present, & Future
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 

Ähnlich wie Fine-Grained Security for Spark and Hive

Kafka/SMM Crash Course
Kafka/SMM Crash CourseKafka/SMM Crash Course
Kafka/SMM Crash Course
DataWorks Summit
 

Ähnlich wie Fine-Grained Security for Spark and Hive (20)

Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
 
Kafka/SMM Crash Course
Kafka/SMM Crash CourseKafka/SMM Crash Course
Kafka/SMM Crash Course
 
Curing the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging ManagerCuring the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging Manager
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Managing enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemManaging enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystem
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Top Ten Tips for IBM i Security and Compliance
Top Ten Tips for IBM i Security and ComplianceTop Ten Tips for IBM i Security and Compliance
Top Ten Tips for IBM i Security and Compliance
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Avaya IP Office Customer Call Reporter
Avaya IP Office Customer Call ReporterAvaya IP Office Customer Call Reporter
Avaya IP Office Customer Call Reporter
 
Streaming analytics manager
Streaming analytics managerStreaming analytics manager
Streaming analytics manager
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Honeywell Experion HS
Honeywell Experion HSHoneywell Experion HS
Honeywell Experion HS
 
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerParis FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging Manager
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 

Mehr von DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Fine-Grained Security for Spark and Hive

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Fine-Grained Security for Spark and Hive Carter Shanklin - Director PM Don Bosco Durai - Security Architect June 29, 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda ● Current security options and challenges ● Apache Ranger Overview ● LLAP Overview ● Use Cases and Demo ● Apache Atlas Integration
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current Options and Challenges
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current Options and Challenges ⬢ Limited to storage level access control for Spark, Pig and MR ⬢ Column Level Access via HiveServer2 ⬢ Row Level filtering need Hive Views – Multiple Hive Views needs to be created and managed – Explicit permissions need to be given for each view/user – User need to know which view to use ⬢ Masking needs custom UDF – Needs to be wrapped using Views
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger Overview
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger • Central audit location for all access requests • Support multiple destination sources (HDFS, Solr, etc.) • Real-time visual query interface AuditingAuthorization • Store and manage encryption keys • Support HDFS TDE • Integration with HSM Ranger KMS • Centralized platform to define, administer and manage security policies consistently • Enforce policies within each component
  • 7. © Hortonworks Inc. 2015. All Rights Reserved
  • 8. © Hortonworks Inc. 2015. All Rights Reserved
  • 9. © Hortonworks Inc. 2015. All Rights Reserved Ranger Architecture HDFS Ranger Administration Portal HBase Hive Server2 Ranger Audit Server Ranger Plugin HadoopComponentsEnterprise Users Ranger Plugin Ranger Plugin Legacy Tools and Data Governance HDFS Knox NifI Ranger Plugin Ranger Plugin RDBMS Solr Ranger Plugin Ranger Policy Server Integration API Kafka Ranger Plugin YARN Ranger Plugin Ranger Plugin Storm Ranger Plugin Atlas
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Audits - Data Access
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Audits - Admin Actions
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Overview
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2.0 and LLAP ⬢ At a High Level: – 2000+ features, improvements and bug fixes in Hive since HDP 2.4. – 600+ of these from outside of Hortonworks. ⬢ Major Improvements: – Preview: Hive LLAP: Persistent query servers with intelligent in-memory caching. – ACID GA: Hardened and proven at scale. – Expanded SQL Compliance: More capable integration with BI tools. – Performance: Interactive query, 2x faster ETL. – Security: Row / Column security extending to views, Column level security for Spark.
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Open Interfaces
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Integration with Hive and LLAP
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive / LLAP Security Capabilities with Ranger ⬢ Ranger Hive plugin provides authorization / access controls. ⬢ Column Masking: – Inject Hive UDFs that mask characters or hash values. – Dynamic, per-user. ⬢ Dynamic Row Filtering: – Query is analyzed and policies applied. – Dynamic, per-user. ⬢ All operations run as ordinary SQL queries: – Masking statements convert to clauses in the SQL select clause. – Filters convert to clauses in the SQL where clause.
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Native Hive Masking Capabilities UDF Purpose Example Start Example Result mask Convert letters to X/x and numbers to n. 123 Fake St. nnn Xxxx Xx. mask_first_n Mask only the first n characters. 433-54-3937 nnn-54-3937 mask_last_n Mask only the last n characters. 433-54-3937 433-54-nnnn mask_show_first_n Mask, showing only the first n characters. 555-233-1234 555-nnn-nnnn mask_show_last_n Mask, showing only the last n characters. 433-54-3937 nnn-nn-3937 mask_hash Produce a consistent hash of the field. CA 21f241cccaa5cfa33190f56ff1510e37
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Delivering Spark Security
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Features: Spark Column Security with LLAP ⬢ Fine-Grained Column Level Access Control for SparkSQL. ⬢ Fully dynamic policies per user. Doesn’t require views. ⬢ Use Standard Ranger policies and tools to control access and masking policies. Flow: 1. SparkSQL gets data locations known as “splits” from HiveServer and plans query. 2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. 3. Spark gets a modified query plan based on dynamic security policy. 4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server.
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example: Per-User Row Filtering by Region in SparkSQL
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Setup ⬢Customer User and Sales data in ORC (Metadata in MetaStore) ⬢Data can be access via SparkSQL or HiveServer2 ⬢Marketing needs access to Sales and Users data for analytics ⬢Fraud Investigation team needs access to data for fraud detection ⬢Billing team needs access to Sales and Users data for billing Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip Sales customer_id product_id promotion_id cookie_id tracking_id Group Users Fraud frank Marketing mark Billing bill Tables
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 1: Restricting Column Access This is a simple use case where certain groups or users don’t permission to view the query ⬢Billing group has access to all columns in table Users ⬢Marketing group can’t access credit card column from table Users Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column customer_phone customer_ccn bill (Billing) 😀 😀 mark (Marketing) 😀 😡
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Restrict Columns
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Restrict Columns - Results bill from Billing mark from Marketing
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Restrict Columns - Audit Screen
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 2: Column Masking In this use case where certain groups or users won't be able to see the real value of certain columns. ⬢Billing group can see the real/raw values for all columns in table Users ⬢Fraud group can only see masked values of PII and PCI fields from table Users Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column customer_email, customer_phone, customer_ccn bill (Billing) 😀 frank (Fraud) 😎
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policies - Mask Fields
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Column Masking - Results bill from Billing frank from Fraud
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Column Masking - Audit Screen
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 3: Row Filtering In this use case where certain groups or users won't be able to see all the rows from certain tables ⬢Billing group can see all the rows in the table Users ⬢Marketing can only see rows/data from their region in the table Users Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column Rows in Users table bill (Billing) 😀 Mark (Marketing- CA) Only CA Users
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policies - Row Filtering
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Row Filtering - Results bill from Billing mark from Marketing
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 4: Row Filtering - Cross Table This an extension of previous use cases, where the context information for filtering the row is in another table. ⬢Billing group can see all the rows in the table Sales ⬢Marketing can only see rows/data from their region in the table Sales, however Sales table doesn’t have the customer geographic information, so it needs to be derived from Users table Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column Rows in Sales table bill (Billing) 😀 Mark (Marketing- CA) Only CA Users Sales customer_id product_id promotion_id cookie_id tracking_id
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policies - Row Filtering - Cross Table
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Integration
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cross Product Symbiosis Apache Atlas Apache Ranger LLAP Classification/ Tagging Governance Lineage Tag Based Policies Dynamic Custom Policies Enforcement hooks HDFS S3 Meta Store * Column Masking and Row Filtering not yet supported by tag based policy
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger - Tag Based Policies
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Q & A