SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Securing Spark Applications
Kostas Sakellis
Marcelo Vanzin
What is Security?
• Security has many facets
• This talk will focus on three areas:
– Encryption
– Authentication
– Authorization
Why do I need security?
• Multi-tenancy
• Application isolation
• User identification
• Access control enforcement
• Compliance with government regulations
Before we go further...
• Set up Kerberos
• Use HDFS (or another secure filesystem)
• Use YARN!
• Configure them for security (enable auth, encryption).
Kerberos, HDFS, and YARN provide the security backbone
for Spark.
Encryption
• In a secure cluster, data should not be visible in the clear
• Very important to financial / government institutions
What a Spark app looks like
RM NM NM
AM / Driver Executor
Executor
SparkSubmit
Control RPC
File Download
Shuffle / Cached Blocks
Shuffle
Service
Shuffle
Service
Shuffle Blocks
UI
Shuffle Blocks / Metadata
Data Flow in Spark
Every connection in the previous slide can transmit sensitive
data!
• Input data transmitted via broadcast variables
• Computed data during shuffles
• Data in serialized tasks, files uploaded with the job
How to prevent other users from seeing this data?
Encryption in Spark
• Almost all channels support encryption.
– Exception 1: UI (SPARK-2750)
– Exception 2: local shuffle / cache files (SPARK-5682)
For local files, set up YARN local dirs to point at local
encrypted disk(s) if desired. (SPARK-5682)
Encryption: Current State
Different channel, different method.
• Shuffle protocol uses SASL
• RPC / File download use SSL
SSL can be hard to set up.
• Need certificates readable on every node
• Sharing certificates not as secure
• Hard to have per-user certificate
Encryption: The Goal
SASL everywhere for wire encryption (except UI).
• Minimum configuration (one boolean config)
• Uses built-in JVM libraries
• SPARK-6017
For UI:
• Support for SSL
• Or audit UI to remove sensitive info (e.g. information on
environment page).
Authentication
Who is reading my data?
• Spark uses Kerberos
– the necessary evil
• Ubiquitous among other services
– YARN, HDFS, Hive, HBase etc.
Who’s reading my data?
Kerberos provides secure authentication.
KDC
Application
Hi I’m Bob.
Hello Bob. Here’s your TGT.
Here’s my TGT. I want to talk to HDFS.
Here’s your HDFS ticket.
User
Now with a distributed app...
KDC
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Something
is wrong.
Kerberos in Hadoop / Spark
KDCs do not allow multiple concurrent logins at the scale
distributed applications need. Hadoop services use
delegation tokens instead.
Driver
NameNode
Executor
DataNode
Delegation Tokens
Like Kerberos tickets, they have a TTL.
• OK for most batch applications.
• Not OK for long running applications
– Streaming
– Spark SQL Thrift Server
Delegation Tokens
Since 1.4, Spark can manage delegation tokens!
• Restricted to HDFS currently
• Requires user’s keytab to be deployed with application
• Still some remaining issues in client deploy mode
Authorization
How can I share my data?
Simplest form of authorization: file permissions.
• Use Unix-style permissions or ACLs to let others read
from and / or write to files and directories
• Simple, but high maintenance. Set permissions /
ownership for new files, mess with umask, etc.
More than just FS semantics...
Authorization becomes more complicated as abstractions
are created.
• Tables, columns, partitions instead of files and
directories
• Semantic gap
• Need a trusted entity to enforce access control
Trusted Service: Hive
Hive has a trusted service (“HiveServer2”) for enforcing
authorization.
• HS2 parses queries and makes sure users have access
to the data they’re requesting / modifying.
HS2 runs as a trusted user with access to the whole
warehouse. Users don’t run code directly in HS2*, so there’s
no danger of code escaping access checks.
Untrusted Apps: Spark
Each Spark app runs as the requesting user, and needs
access to the underlying files.
• Spark itself cannot enforce access control, since it’s
running as the user and is thus untrusted.
• Restricted to file system permission semantics.
How to bridge the two worlds?
Apache Sentry
• Role-based access control to resources
• Integrates with Hive / HS2 to control access to data
• Fine-grained (up to column level) controls
Hive data and HDFS data have different semantics. How to
bridge that?
The Sentry HDFS Plugin
Synchronize HDFS file permissions with higher-level
abstractions.
• Permission to read table = permission to read table’s
files
• Permission to create table = permission to write to
database’s directory
Uses HDFS ACLs for fine-grained user permissions.
Still restricted to FS view of the world!
• Files, directories, etc…
• Cannot provide column-level and row-level access
control.
• Whole table or nothing.
Still, it goes a long way in allowing Spark applications to
work well with Hive data in a shared, secure environment.
But...
Future: RecordService
A distributed, scalable, data access service for unified
authorization in Hadoop.
RecordService
RecordService
• Drop in replacement for InputFormats
• SparkSQL: Integration with Data Sources API
– Predicate pushdown, projection
RecordService
• Assume we had a table tpch.nation
column_name column_type
n_nationkey smallint
n_name string
n_regionkey smallint
n_comment string
import com.cloudera.recordservice.spark._
val context = new org.apache.spark.sql.SQLContext(sc)
val df = context.load("tpch.nation",
"com.cloudera.recordservice.spark")
val results = df.groupBy("n_regionkey")
.count()
.collect()
RecordService
RecordService
• Users can enforce Sentry permissions using views
• Allows column and row level security
> CREATE ROLE restrictedrole;
> GRANT ROLE restrictedrole to GROUP restrictedgroup;
> USE tpch;
> CREATE VIEW nation_names AS
SELECT n_nationkey, n_name
FROM tpch.nation;
> GRANT SELECT ON TABLE tpch.nation_names TO ROLE restrictedrole;
...
val df = context.load("tpch.nation",
"com.cloudera.recordservice.spark")
val results = df.collect()
>> TRecordServiceException(code:INVALID_REQUEST, message:Could not plan
request., detail:AuthorizationException: User 'kostas' does not have
privileges to execute 'SELECT' on: tpch.nation)
RecordService
...
val df = context.load("tpch.nation_names",
"com.cloudera.recordservice.spark")
val results = df.collect()
RecordService
RecordService
• Documentation: http://cloudera.github.io/RecordServiceClient/
• Beta Download:
http://www.cloudera.com/content/cloudera/en/downloads/betas/recordservic
e/0-1-0.html
Takeaways
• Spark can be made secure today!
• Benefits from a lot of existing Hadoop platform work
• Still work to be done
– Ease of use
– Better integration with Sentry / RecordService
References
• Encryption: SPARK-6017, SPARK-5682
• Delegation tokens: SPARK-5342
• Sentry: http://sentry.apache.org/
– HDFS synchronization: SENTRY-432
• RecordService:
http://cloudera.github.io/RecordServiceClient/
Thanks!
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 

Was ist angesagt? (20)

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 

Andere mochten auch

Системный анализ межкультурной компетентности будущих переводчиков
Системный анализ межкультурной компетентности будущих переводчиковСистемный анализ межкультурной компетентности будущих переводчиков
Системный анализ межкультурной компетентности будущих переводчиков
Scientific and Educational Initiative
 
Experiències TIC en altres països
Experiències TIC en altres païsosExperiències TIC en altres països
Experiències TIC en altres països
Miriam Micó
 
Математическая одаренность и ее развитие
Математическая одаренность и ее развитиеМатематическая одаренность и ее развитие
Математическая одаренность и ее развитие
Scientific and Educational Initiative
 
Объективизация оценки освоения хирургических навыков: структурированный экзам...
Объективизация оценки освоения хирургических навыков: структурированный экзам...Объективизация оценки освоения хирургических навыков: структурированный экзам...
Объективизация оценки освоения хирургических навыков: структурированный экзам...
Scientific and Educational Initiative
 
Flora y fauna de chile
Flora y fauna de chileFlora y fauna de chile
Flora y fauna de chile
profesoraudp
 

Andere mochten auch (20)

Bictacoras de tecnología
Bictacoras de tecnologíaBictacoras de tecnología
Bictacoras de tecnología
 
Системный анализ межкультурной компетентности будущих переводчиков
Системный анализ межкультурной компетентности будущих переводчиковСистемный анализ межкультурной компетентности будущих переводчиков
Системный анализ межкультурной компетентности будущих переводчиков
 
Content Marketing: How to Engage Customers and Build Your Small Business (Man...
Content Marketing: How to Engage Customers and Build Your Small Business (Man...Content Marketing: How to Engage Customers and Build Your Small Business (Man...
Content Marketing: How to Engage Customers and Build Your Small Business (Man...
 
Experiències TIC en altres països
Experiències TIC en altres païsosExperiències TIC en altres països
Experiències TIC en altres països
 
grand vista powerpoint
grand vista powerpointgrand vista powerpoint
grand vista powerpoint
 
Performance Appraisal
Performance AppraisalPerformance Appraisal
Performance Appraisal
 
Selling to seniors & web design for seniors
Selling to seniors & web design for seniorsSelling to seniors & web design for seniors
Selling to seniors & web design for seniors
 
Chapter 4
Chapter 4Chapter 4
Chapter 4
 
Исследование потребительских качеств городской среды
Исследование потребительских качеств городской средыИсследование потребительских качеств городской среды
Исследование потребительских качеств городской среды
 
Trabajo colaborativo smart power
Trabajo colaborativo smart powerTrabajo colaborativo smart power
Trabajo colaborativo smart power
 
Bitácoras de laboratorio
Bitácoras de laboratorioBitácoras de laboratorio
Bitácoras de laboratorio
 
Bel2
Bel2Bel2
Bel2
 
Survey analysis
Survey analysisSurvey analysis
Survey analysis
 
Diez verdades del Régimen de Incorporación Fiscal
Diez verdades del Régimen de Incorporación FiscalDiez verdades del Régimen de Incorporación Fiscal
Diez verdades del Régimen de Incorporación Fiscal
 
Математическая одаренность и ее развитие
Математическая одаренность и ее развитиеМатематическая одаренность и ее развитие
Математическая одаренность и ее развитие
 
Enlace Ciudadano Nro 317 tema: aclaración 316 oferta total de energía primari...
Enlace Ciudadano Nro 317 tema: aclaración 316 oferta total de energía primari...Enlace Ciudadano Nro 317 tema: aclaración 316 oferta total de energía primari...
Enlace Ciudadano Nro 317 tema: aclaración 316 oferta total de energía primari...
 
Объективизация оценки освоения хирургических навыков: структурированный экзам...
Объективизация оценки освоения хирургических навыков: структурированный экзам...Объективизация оценки освоения хирургических навыков: структурированный экзам...
Объективизация оценки освоения хирургических навыков: структурированный экзам...
 
Mar e literatura
Mar e literaturaMar e literatura
Mar e literatura
 
Flora y fauna de chile
Flora y fauna de chileFlora y fauna de chile
Flora y fauna de chile
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 

Ähnlich wie Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin

Owasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet OverviewOwasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet Overview
owaspindy
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 

Ähnlich wie Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin (20)

Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Owasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet OverviewOwasp Indy Q2 2012 Cheat Sheet Overview
Owasp Indy Q2 2012 Cheat Sheet Overview
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
From 0 to syncing
From 0 to syncingFrom 0 to syncing
From 0 to syncing
 
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Hadoop Security: Overview
Hadoop Security: OverviewHadoop Security: Overview
Hadoop Security: Overview
 
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCombat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
Geek Night 15.0 - Touring the Dark-Side of the Internet
Geek Night 15.0 - Touring the Dark-Side of the InternetGeek Night 15.0 - Touring the Dark-Side of the Internet
Geek Night 15.0 - Touring the Dark-Side of the Internet
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big data security
Big data securityBig data security
Big data security
 
Security in the cloud Workshop HSTC 2014
Security in the cloud Workshop HSTC 2014Security in the cloud Workshop HSTC 2014
Security in the cloud Workshop HSTC 2014
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 

Mehr von Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 

Kürzlich hochgeladen (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 

Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin

  • 1. Securing Spark Applications Kostas Sakellis Marcelo Vanzin
  • 2. What is Security? • Security has many facets • This talk will focus on three areas: – Encryption – Authentication – Authorization
  • 3. Why do I need security? • Multi-tenancy • Application isolation • User identification • Access control enforcement • Compliance with government regulations
  • 4. Before we go further... • Set up Kerberos • Use HDFS (or another secure filesystem) • Use YARN! • Configure them for security (enable auth, encryption). Kerberos, HDFS, and YARN provide the security backbone for Spark.
  • 5. Encryption • In a secure cluster, data should not be visible in the clear • Very important to financial / government institutions
  • 6. What a Spark app looks like RM NM NM AM / Driver Executor Executor SparkSubmit Control RPC File Download Shuffle / Cached Blocks Shuffle Service Shuffle Service Shuffle Blocks UI Shuffle Blocks / Metadata
  • 7. Data Flow in Spark Every connection in the previous slide can transmit sensitive data! • Input data transmitted via broadcast variables • Computed data during shuffles • Data in serialized tasks, files uploaded with the job How to prevent other users from seeing this data?
  • 8. Encryption in Spark • Almost all channels support encryption. – Exception 1: UI (SPARK-2750) – Exception 2: local shuffle / cache files (SPARK-5682) For local files, set up YARN local dirs to point at local encrypted disk(s) if desired. (SPARK-5682)
  • 9. Encryption: Current State Different channel, different method. • Shuffle protocol uses SASL • RPC / File download use SSL SSL can be hard to set up. • Need certificates readable on every node • Sharing certificates not as secure • Hard to have per-user certificate
  • 10. Encryption: The Goal SASL everywhere for wire encryption (except UI). • Minimum configuration (one boolean config) • Uses built-in JVM libraries • SPARK-6017 For UI: • Support for SSL • Or audit UI to remove sensitive info (e.g. information on environment page).
  • 11. Authentication Who is reading my data? • Spark uses Kerberos – the necessary evil • Ubiquitous among other services – YARN, HDFS, Hive, HBase etc.
  • 12. Who’s reading my data? Kerberos provides secure authentication. KDC Application Hi I’m Bob. Hello Bob. Here’s your TGT. Here’s my TGT. I want to talk to HDFS. Here’s your HDFS ticket. User
  • 13. Now with a distributed app... KDC Executor Executor Executor Executor Executor Executor Executor Executor Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Hi I’m Bob. Something is wrong.
  • 14. Kerberos in Hadoop / Spark KDCs do not allow multiple concurrent logins at the scale distributed applications need. Hadoop services use delegation tokens instead. Driver NameNode Executor DataNode
  • 15. Delegation Tokens Like Kerberos tickets, they have a TTL. • OK for most batch applications. • Not OK for long running applications – Streaming – Spark SQL Thrift Server
  • 16. Delegation Tokens Since 1.4, Spark can manage delegation tokens! • Restricted to HDFS currently • Requires user’s keytab to be deployed with application • Still some remaining issues in client deploy mode
  • 17. Authorization How can I share my data? Simplest form of authorization: file permissions. • Use Unix-style permissions or ACLs to let others read from and / or write to files and directories • Simple, but high maintenance. Set permissions / ownership for new files, mess with umask, etc.
  • 18. More than just FS semantics... Authorization becomes more complicated as abstractions are created. • Tables, columns, partitions instead of files and directories • Semantic gap • Need a trusted entity to enforce access control
  • 19. Trusted Service: Hive Hive has a trusted service (“HiveServer2”) for enforcing authorization. • HS2 parses queries and makes sure users have access to the data they’re requesting / modifying. HS2 runs as a trusted user with access to the whole warehouse. Users don’t run code directly in HS2*, so there’s no danger of code escaping access checks.
  • 20. Untrusted Apps: Spark Each Spark app runs as the requesting user, and needs access to the underlying files. • Spark itself cannot enforce access control, since it’s running as the user and is thus untrusted. • Restricted to file system permission semantics. How to bridge the two worlds?
  • 21. Apache Sentry • Role-based access control to resources • Integrates with Hive / HS2 to control access to data • Fine-grained (up to column level) controls Hive data and HDFS data have different semantics. How to bridge that?
  • 22. The Sentry HDFS Plugin Synchronize HDFS file permissions with higher-level abstractions. • Permission to read table = permission to read table’s files • Permission to create table = permission to write to database’s directory Uses HDFS ACLs for fine-grained user permissions.
  • 23. Still restricted to FS view of the world! • Files, directories, etc… • Cannot provide column-level and row-level access control. • Whole table or nothing. Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment. But...
  • 24. Future: RecordService A distributed, scalable, data access service for unified authorization in Hadoop.
  • 26. RecordService • Drop in replacement for InputFormats • SparkSQL: Integration with Data Sources API – Predicate pushdown, projection
  • 27. RecordService • Assume we had a table tpch.nation column_name column_type n_nationkey smallint n_name string n_regionkey smallint n_comment string
  • 28. import com.cloudera.recordservice.spark._ val context = new org.apache.spark.sql.SQLContext(sc) val df = context.load("tpch.nation", "com.cloudera.recordservice.spark") val results = df.groupBy("n_regionkey") .count() .collect() RecordService
  • 29. RecordService • Users can enforce Sentry permissions using views • Allows column and row level security > CREATE ROLE restrictedrole; > GRANT ROLE restrictedrole to GROUP restrictedgroup; > USE tpch; > CREATE VIEW nation_names AS SELECT n_nationkey, n_name FROM tpch.nation; > GRANT SELECT ON TABLE tpch.nation_names TO ROLE restrictedrole;
  • 30. ... val df = context.load("tpch.nation", "com.cloudera.recordservice.spark") val results = df.collect() >> TRecordServiceException(code:INVALID_REQUEST, message:Could not plan request., detail:AuthorizationException: User 'kostas' does not have privileges to execute 'SELECT' on: tpch.nation) RecordService
  • 31. ... val df = context.load("tpch.nation_names", "com.cloudera.recordservice.spark") val results = df.collect() RecordService
  • 32. RecordService • Documentation: http://cloudera.github.io/RecordServiceClient/ • Beta Download: http://www.cloudera.com/content/cloudera/en/downloads/betas/recordservic e/0-1-0.html
  • 33. Takeaways • Spark can be made secure today! • Benefits from a lot of existing Hadoop platform work • Still work to be done – Ease of use – Better integration with Sentry / RecordService
  • 34. References • Encryption: SPARK-6017, SPARK-5682 • Delegation tokens: SPARK-5342 • Sentry: http://sentry.apache.org/ – HDFS synchronization: SENTRY-432 • RecordService: http://cloudera.github.io/RecordServiceClient/