Weitere ähnliche Inhalte Ähnlich wie Running Spark in Production (20) Mehr von DataWorks Summit/Hadoop Summit (20) Kürzlich hochgeladen (20) Running Spark in Production2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Are We?
Product Management
Spark for 2+ years, Hadoop for 3+ years
Recovering Programmer
Blog at www.vinayshukla.com
Twitter: @neomythos
Addicted to Yoga, Hiking, & Coffee
Member of Technical Staff
Apache Spark Contributor
Vinay Shukla Saisai (Jerry) Shao
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Security : Secure Production Deployments
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client
YARN-Cluster
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark on YARN
Spark Submit
User
Spark
AM
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN
RM
YARN
RM
4
2 3
Node
Manager
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Security – Four Pillars
Authentication
Authorization
Audit
Encryption
Spark leverages Kerberos on YARN
This presumes network is secured,
the first step of security
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos Primer
Client
KDC
NN
DN
1. kinit - Login and get Ticket Granting Ticket (TGT)
3. Get NameNode Service Ticket (NN-ST)
2. Client Stores TGT in Ticket
Cache
4. Client Stores NN-ST in Ticket
Cache
5. Read/write file given NN-ST and
file name; returns block locations,
block IDs and Block Access Tokens
if access permitted
6. Read/write block given
Block Access Token and block ID
Client’s
Kerberos
Ticket Cache
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos authentication within Spark
KDC
Use Spark ST, submit Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark,
John
Doe
Spark AMSpark AM
NNNN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark + X (Source of Data)
KDC
Use Spark ST, submit Spark Job
Spark gets X ST
YARN launches Spark
Executors using John
Doe’s identity
Get Service Ticket (ST)
for Spark
John
Doe
Spark AMSpark AM
XX
Executor reads from X using John
Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab
johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --num-executors 3 --driver-memory 512m
--executor-memory 512m --executor-cores 1 lib/spark-
examples*.jar 10
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN) service
ticket
Executors read
from HDFS
Client gets service ticket
for Spark
John
Doe
RangerRangerCan John launch this job?
Can John read this file
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Communication Channels – YARN-CLUSTER Mode
Spark
Submit
RM
Shuffle
Service
AM
Driver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Data
Source
Read/Write
Data
FS – Broadcast,
File Download
13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Channel Encryption - Example
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Gotchas with Spark Security
Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd
hop
– Forces STS to run as Hive user to read all data
– Reduces security
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
SparkSQL – Granular security unavailable
– Ranger integration will solve this problem (TP coming)
– Brings Row/Column level/Masking features to SparkSQL
Spark + HBase with Kerberos
– Issue fixed in Spark 1.4 (Spark-6918)
Spark Stream + Kafka + Kerberos + SSL
– Issues fixed in HDP 2.4.x
Spark jobs > 72 Hours
– Kerberos token not renewed before Spark 1.5
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#1 Big or Small Executor ?
spark-submit --master yarn
--deploy-mode client
--num-executors ?
--executor-cores ?
--executor-memory ?
--class MySimpleApp
mySimpleApp.jar
arg1 arg2
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Show the Details
Cluster resource: 72 cores + 288GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark with Different Combinations
Cores
Per E#
Memory
Per E
(GB)
E Per
Node
1 6 18
2 12 9
3 18 6
6 36 3
9 54 2
18 108 1
#E stands for Executor
Except too small or too big executor, the
performance is relatively close
18 cores 108GB memory per node# The lower the better
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Or Small Executor ?
Avoid too small or too big executor.
– Small executor will decrease CPU and memory efficiency.
– Big executor will introduce heavy GC overhead.
Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Any Other Thing ?
Executor memory != Container memory
Container memory = executor memory + overhead memory (10% of executory memory)
Leave some resources to other service and system
Enable CPU scheduling if you want to constrain CPU usage#
#CPU scheduling -
http://hortonworks.com/blog/managing-cpu-resources-in-y
21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#2 Task Parallelism ?
What is the preferable number of partitions (task parallelism) ?
rdd.groupByKey(numPartitions)
rdd.reduceByKey(func, [numPartitions])
rdd.sortByKey([ascending], [numPartitions])
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Task Parallelism ?
map map map
reduce1 reduce2
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Task Parallelism ?
Task parallelism == number of reducer
Increasing the task parallelism will reduce the data to be processed for each task
map map map
reduce1 reduce2 reduce2
24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big or Small Task ?
Small task
– Fine-grained partition can mitigate data skew problem to some extent.
– With relatively small data to be processed in each task can alleviate disk spill.
– Task number will be increased, this will somehow increase the scheduling overhead.
Big task
– Too big shuffle block (> 2GB) will crash the Spark application.
– Easily introduce data skew problem.
– Frequent GC problem.
Summary
– Fully make using of cluster parallelism.
– Consider a preferable task size according to the size of shuffle input (~128MB).
– Prefer small task rather than big task (increase the task parallelism).
25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark With Different Task Parallelism
# The lower the better
26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusion
Different workloads have different processing pattern, knowing the pattern will
give a right direction to tune
No one rule could fit all situations, based on the monitored behavior to tune
workload can save the effort and get a better result
The more you know about the framework, the better you will get the tuning
result
27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Vinay Shukla, @neomythos
Saisai (Jerry) Shao
Hinweis der Redaktion Spark on YARN is the only mode with Kerberos security integration
John Doe first authenticates to Kerberos before launching Spark Shell
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
The first step of security is network security
The second step of security is Authentication
Most Hadoop echo system projects rely on Kerberos for Authentication
Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus
Client talks to KDC with Kerberos Library
Orange line – Client to KDC communication
Green line – Client to HDFS communication, does not talk to Kerberos/KDC
John Doe first authenticates to Kerberos before launching Spark Shell
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
John Doe first authenticates to Kerberos before launching Spark Shell
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
Controlling HDFS Authorization is easy/Done
Controlling Hive row/column level authorization in Spark is WIP
For HDFS as Data Source can use RPC or use SSL with WebHDFS
For NM Shuffle Data – Use YARN SSL
Spark support SSL for FS (Broadcast or File download)
Shuffle Block Transfer supports SASL based encryption – SSL coming
All Images from Flicker Commons