Don't Let the Spark Burn Your House: Perspectives on Securing Spark

1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Vinay Shukla Srikanth Venkat
Director, Product Management Senior Director, Product Management
@neomythos @srikvenk
Don’t Let a Spark Burn Your House:
Perspectives on Securing Spark

About us…
Vinay Shukla
Director of Product Management, Data Science
Spark & Zeppelin
Srikanth Venkat
Senior Director of Product Management, Security & Governance
Apache Ranger, Apache Atlas, Apache Knox, HDP Platform Security

Securing Spark in the Hadoop Castle…..
Secure In-Cluster Access :
Wire Encryption
Data At Rest Protection:
HDFS Encryption
Authorization & Audit:
HDFS ACLs, YARN ACLs, Apache RangerPerimeter Security:
Network Segmentation,
Firewalls
Authentication: LDAP/AD, Kerberos, Apache Knox
Secure Gateway: Apache Knox

Challenges in Securing Enterprise Deployments of Spark
 How to deploy Spark securely?
AAA: Authentication, Authorization & Audits
Network and Perimeter Security
Protect data both in motion & at rest
 Make security easy to deploy, administer, manage and govern

Guiding Principles
 Secure the Network access
 Firewalls
 Use Secure gateways and trusted proxies (Apache Knox)
 Provide access only to authorized users
 LDAP/AD
 Kerberos
 Service level authorization (Apache Knox)
 Secure data sources with coarse fine grained authorizations
 Hive (databases, tables, columns..)
 HDFS (files, folders)
 Apache Ranger for Audits and ABAC authroizations
 Data Protection at rest and in motion
 HDFS TDE (data encryption at rest)
 Wire encryption, SSL (data in motion)

Many ways to interact with Spark
Ex
Spark on YARN
Zeppelin
Spark-
Shell
Ex
Spark
Thrift
Server
Livy REST
Server
D
r
i
v
e
r
D
r
i
v
e
r
D
r
i
v
e
r
D
r
i
v
e
r
D
r
i
v
e
r
Spark Driver
Livy REST
Server
D
r
i
v
e
r
With Livy
Interpreter
Spark
Interpreter
Firewall
Custom
Web
App
BI Tool

Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client YARN-Cluster

How Spark on YARN works
Spark Submit
Jane Doe
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN RM
4
2 3
Node
Manager

Authenticate users with AD/LDAP
KDC
Use Spark ST, submit Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark
Jane Doe
Spark AM
NN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
AD/LDAP

HDFS
Authorization: Secure user access to data sources and queues
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN)
service ticket
Executors
read from
HDFS
Client gets service
ticket for Spark
Ranger
Can Jane launch jobs in this queue?
Can Jane read this file
Jane Doe
Firewall

Livy RESTful Access to Spark
Livy supports only Kerberos/SPNEGO based authentication, no LDAP support
Livy default port 8999 & by default runs in yarn-cluster mode
See https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/

SparkThirftServer doAs
1. End User > Spark Thirft Server > Spark Job runs as end user
2. Provides coarse grained (table/file) level access control
3. Only fixed for Spark 1.6 & available in HDP 2.6 & 2.5.x
4. Use SparkSQL + LLAP (Ranger Integration) for fine grained access control (row/column) & masking
(works with both Spark 1.6 & Spark 2.1)
See https://community.hortonworks.com/articles/101418/user-impersonation-in-apache-spark-16-
thrift-serve.html

More ways to interact with Spark
• With Kerberos
• Over SSL
• https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-
component-guide/content/using-spark-streaming.html#spark-streaming-kerb-job
• https://community.hortonworks.com/content/kbentry/55154/kafka-ssl-kerberos-cheat-
sheet-settingsconsole-com.html

Yet more ways to interact with Spark
https://github.com/hortonworks-spark/shc
• With Kerberos
kinit -k -t /tmp/hrt_qa.headless.keytab hrt_qa
/usr/hdp/current/spark-client/bin/spark-submit --class your.application.class --master yarn-
client --files /etc/hbase/conf/hbase-site.xml --packages com.hortonworks:shc-core:1.1.1-2.1-
s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ /To/your/application/jar
/usr/hdp/current/spark-client/bin/spark-submit --class your.application.class --master yarn-
cluster --files /etc/hbase/conf/hbase-site.xml --packages com.hortonworks:shc-core:1.1.1-2.1-
s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ /To/your/application/jar

Fine-Grained Security:
SparkSQL/Hive LLAP with Ranger

SparkSQL Security: Row Filtering and Column Masking
 Spark SQL + Hive use cases enable users to explore data lakes and
democratize data access without sacrificing security
 Spark provides strong authentication via Kerberos and wire encryption via
SSL but as general purpose compute has no built in authorization sub-system
(yet)
 Spark also does not currently have any way to define a pluggable module
that contains policies for fine grain authorization
 Use Cases:
– Co-mingled data in the same table may belong to two different groups, each with their own
regulatory requirements.
– Data may have regional restrictions, time based availability restrictions, departmental restrictions,
etc.

21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Hortonworks Confidential. For Internal Use Only.
Hive LLAP – Open Interfaces
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
Spark

Key Features: Spark Column Security with LLAP
 Fine-Grained Column Level Access Control for SparkSQL.
 Fully dynamic policies per user without proliferation of views and resulting view management overhead
 Use Standard Ranger infrastructure to control resource and apply row filtering and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query plan
based on dynamic security policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3

Dynamic Row Filtering & Column Masking: SparkSQL via Hive LLAP
User 2: Ivanna
Location : EU
Group: HRUser 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National
ID
CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx xxxx null John Doe
US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic
Ranger Policies: Filter rows by region
& apply relevant column masking
Users from US Analyst group see data
for US persons with CC and National ID
(SSN) as masked values and MRN is
nullified
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
EU HR Policy Admins can see
unmasked but are restricted
by row filtering policies to see
data for EU persons only
Original Query:
SELECT country, nationalid,
name, mrn FROM
ww_customers

Key Benefit of SparkSQL + Ranger Integration
 Shared Access Control Policy between SparkSQL and Hive
 Audit: All access via SparkSQL audited searchable through Ranger
 Resource Management: Each user can use a unique queue while accessing the securely shared data
 Minimum Transition Cost: Since this feature offers row/ column level security in SQL, existing Spark 2.1
apps and scripts and all Spark shells (spark-shell, pyspark, sparkR, spark-sql) are supported without any
modifications.
 https://hortonworks.com/blog/row-column-level-control-apache-spark/
 https://community.hortonworks.com/articles/101181/rowcolumn-level-security-in-sql-for-apache-
spark-2.html

Demo of SparkSQL via Hive LLAP with
Ranger Integration

The Road Ahead for Spark Security
 Spark & Atlas Integration
 Livy & Knox Integration
 Zeppelin SSO Integration
 Zeppelin Ranger Integration
 Paassword integration with Hadoop Credentials

Thank You!!
Vinay Shukla
@neomythos
Srikanth Venkat
@srikvenk

Don't Let the Spark Burn Your House: Perspectives on Securing Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Don't Let the Spark Burn Your House: Perspectives on Securing Spark

Ähnlich wie Don't Let the Spark Burn Your House: Perspectives on Securing Spark (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Don't Let the Spark Burn Your House: Perspectives on Securing Spark