SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Running Spark in
Production
Director, Product Management Member, Technical Staff
April 13, 2016
Twitter: @neomythos
Vinay Shukla Saisai (Jerry) Shao
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Are We?
 Product Management
 Spark for 2+ years, Hadoop for 3+ years
 Recovering Programmer
 Blog at www.vinayshukla.com
 Twitter: @neomythos
 Addicted to Yoga, Hiking, & Coffee
 Member of Technical Staff
 Apache Spark Contributor
Vinay Shukla Saisai (Jerry) Shao
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Security : Secure Production Deployments
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client
YARN-Cluster
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark on YARN
Spark Submit
User
Spark
AM
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN
RM
YARN
RM
4
2 3
Node
Manager
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Security – Four Pillars
 Authentication
 Authorization
 Audit
 Encryption
Spark leverages Kerberos on YARN
This presumes network is secured,
the first step of security
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos Primer
Client
KDC
NN
DN
1. kinit - Login and get Ticket Granting Ticket (TGT)
3. Get NameNode Service Ticket (NN-ST)
2. Client Stores TGT in Ticket
Cache
4. Client Stores NN-ST in Ticket
Cache
5. Read/write file given NN-ST and
file name; returns block locations,
block IDs and Block Access Tokens
if access permitted
6. Read/write block given
Block Access Token and block ID
Client’s
Kerberos
Ticket Cache
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos authentication within Spark
KDC
Use Spark ST, submit Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark,
John
Doe
Spark AMSpark AM
NNNN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark + X (Source of Data)
KDC
Use Spark ST, submit Spark Job
Spark gets X ST
YARN launches Spark
Executors using John
Doe’s identity
Get Service Ticket (ST)
for Spark
John
Doe
Spark AMSpark AM
XX
Executor reads from X using John
Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab
johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --num-executors 3 --driver-memory 512m
--executor-memory 512m --executor-cores 1 lib/spark-
examples*.jar 10
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN) service
ticket
Executors read
from HDFS
Client gets service ticket
for Spark
John
Doe
RangerRangerCan John launch this job?
Can John read this file
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Communication Channels – YARN-CLUSTER Mode
Spark
Submit
RM
Shuffle
Service
AM
Driver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Data
Source
Read/Write
Data
FS – Broadcast,
File Download
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Channel Encryption - Example
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Gotchas with Spark Security
 Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd
hop
– Forces STS to run as Hive user to read all data
– Reduces security
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
 SparkSQL – Granular security unavailable
– Ranger integration will solve this problem (TP coming)
– Brings Row/Column level/Masking features to SparkSQL
 Spark + HBase with Kerberos
– Issue fixed in Spark 1.4 (Spark-6918)
 Spark Stream + Kafka + Kerberos + SSL
– Issues fixed in HDP 2.4.x
 Spark jobs > 72 Hours
– Kerberos token not renewed before Spark 1.5
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Performance
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#1 Big or Small Executor ?
spark-submit --master yarn 
--deploy-mode client 
--num-executors ? 
--executor-cores ? 
--executor-memory ? 
--class MySimpleApp 
mySimpleApp.jar 
arg1 arg2
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Show the Details
 Cluster resource: 72 cores + 288GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark with Different Combinations
Cores
Per E#
Memory
Per E
(GB)
E Per
Node
1 6 18
2 12 9
3 18 6
6 36 3
9 54 2
18 108 1
#E stands for Executor
Except too small or too big executor, the
performance is relatively close
18 cores 108GB memory per node# The lower the better
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Or Small Executor ?
 Avoid too small or too big executor.
– Small executor will decrease CPU and memory efficiency.
– Big executor will introduce heavy GC overhead.
 Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Any Other Thing ?
 Executor memory != Container memory
 Container memory = executor memory + overhead memory (10% of executory memory)
 Leave some resources to other service and system
 Enable CPU scheduling if you want to constrain CPU usage#
#CPU scheduling -
http://hortonworks.com/blog/managing-cpu-resources-in-y
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#2 Task Parallelism ?
 What is the preferable number of partitions (task parallelism) ?
rdd.groupByKey(numPartitions)
rdd.reduceByKey(func, [numPartitions])
rdd.sortByKey([ascending], [numPartitions])
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Task Parallelism ?
map map map
reduce1 reduce2
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Task Parallelism ?
 Task parallelism == number of reducer
 Increasing the task parallelism will reduce the data to be processed for each task
map map map
reduce1 reduce2 reduce2
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big or Small Task ?
 Small task
– Fine-grained partition can mitigate data skew problem to some extent.
– With relatively small data to be processed in each task can alleviate disk spill.
– Task number will be increased, this will somehow increase the scheduling overhead.
 Big task
– Too big shuffle block (> 2GB) will crash the Spark application.
– Easily introduce data skew problem.
– Frequent GC problem.
 Summary
– Fully make using of cluster parallelism.
– Consider a preferable task size according to the size of shuffle input (~128MB).
– Prefer small task rather than big task (increase the task parallelism).
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark With Different Task Parallelism
# The lower the better
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusion
 Different workloads have different processing pattern, knowing the pattern will
give a right direction to tune
 No one rule could fit all situations, based on the monitored behavior to tune
workload can save the effort and get a better result
 The more you know about the framework, the better you will get the tuning
result
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Vinay Shukla, @neomythos
Saisai (Jerry) Shao

Weitere ähnliche Inhalte

Was ist angesagt?

Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
DataWorks Summit
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Was ist angesagt? (20)

Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoApache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan Flonenko
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
DNS Security Presentation ISSA
DNS Security Presentation ISSADNS Security Presentation ISSA
DNS Security Presentation ISSA
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Kernel Recipes 2015: Kernel packet capture technologies
Kernel Recipes 2015: Kernel packet capture technologiesKernel Recipes 2015: Kernel packet capture technologies
Kernel Recipes 2015: Kernel packet capture technologies
 

Ähnlich wie Running Spark in Production

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 

Ähnlich wie Running Spark in Production (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
20140708hcj
20140708hcj20140708hcj
20140708hcj
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 

Mehr von DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Running Spark in Production

  • 1. Running Spark in Production Director, Product Management Member, Technical Staff April 13, 2016 Twitter: @neomythos Vinay Shukla Saisai (Jerry) Shao
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who Are We?  Product Management  Spark for 2+ years, Hadoop for 3+ years  Recovering Programmer  Blog at www.vinayshukla.com  Twitter: @neomythos  Addicted to Yoga, Hiking, & Coffee  Member of Technical Staff  Apache Spark Contributor Vinay Shukla Saisai (Jerry) Shao
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Security : Secure Production Deployments
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Context: Spark Deployment Modes • Spark on YARN –Spark driver (SparkContext) in YARN AM(yarn-cluster) –Spark driver (SparkContext) in local (yarn-client): • Spark Shell & Spark Thrift Server runs in yarn-client only Client Executor App MasterSpark Driver Client Executor App Master Spark Driver YARN-Client YARN-Cluster
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark on YARN Spark Submit User Spark AM Spark AM 1 Hadoop Cluster HDFS Executor YARN RM YARN RM 4 2 3 Node Manager
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Security – Four Pillars  Authentication  Authorization  Audit  Encryption Spark leverages Kerberos on YARN This presumes network is secured, the first step of security
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos Primer Client KDC NN DN 1. kinit - Login and get Ticket Granting Ticket (TGT) 3. Get NameNode Service Ticket (NN-ST) 2. Client Stores TGT in Ticket Cache 4. Client Stores NN-ST in Ticket Cache 5. Read/write file given NN-ST and file name; returns block locations, block IDs and Block Access Tokens if access permitted 6. Read/write block given Block Access Token and block ID Client’s Kerberos Ticket Cache
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos authentication within Spark KDC Use Spark ST, submit Spark Job Spark gets Namenode (NN) service ticket YARN launches Spark Executors using John Doe’s identity Get service ticket for Spark, John Doe Spark AMSpark AM NNNN Executor reads from HDFS using John Doe’s delegation token kinit 1 2 3 4 5 6 7 Hadoop Cluster
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark + X (Source of Data) KDC Use Spark ST, submit Spark Job Spark gets X ST YARN launches Spark Executors using John Doe’s identity Get Service Ticket (ST) for Spark John Doe Spark AMSpark AM XX Executor reads from X using John Doe’s delegation token kinit 1 2 3 4 5 6 7 Hadoop Cluster
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Kerberos - Example kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Spark – Authorization YARN Cluster A B C KDC Use Spark ST, submit Spark Job Get Namenode (NN) service ticket Executors read from HDFS Client gets service ticket for Spark John Doe RangerRangerCan John launch this job? Can John read this file
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Communication Channels – YARN-CLUSTER Mode Spark Submit RM Shuffle Service AM Driver NM Ex 1 Ex N Shuffle Data Control/RPC Shuffle BlockTransfer Data Source Read/Write Data FS – Broadcast, File Download
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Channel Encryption - Example Shuffle Data Control/RPC Shuffle BlockTransfer Read/Write Data FS – Broadcast, File Download spark.authenticate.enableSaslEncryption= true spark.authenticate = true. Leverage YARN to distribute keys Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS NM > Ex leverages YARN based SSL spark.ssl.enabled = true
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Gotchas with Spark Security  Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop – Forces STS to run as Hive user to read all data – Reduces security – Use SparkSQL via shell or programmatic API – https://issues.apache.org/jira/browse/SPARK-5159  SparkSQL – Granular security unavailable – Ranger integration will solve this problem (TP coming) – Brings Row/Column level/Masking features to SparkSQL  Spark + HBase with Kerberos – Issue fixed in Spark 1.4 (Spark-6918)  Spark Stream + Kafka + Kerberos + SSL – Issues fixed in HDP 2.4.x  Spark jobs > 72 Hours – Kerberos token not renewed before Spark 1.5
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Performance
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved #1 Big or Small Executor ? spark-submit --master yarn --deploy-mode client --num-executors ? --executor-cores ? --executor-memory ? --class MySimpleApp mySimpleApp.jar arg1 arg2
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Show the Details  Cluster resource: 72 cores + 288GB Memory 24 cores 96GB Memory 24 cores 96GB Memory 24 cores 96GB Memory
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benchmark with Different Combinations Cores Per E# Memory Per E (GB) E Per Node 1 6 18 2 12 9 3 18 6 6 36 3 9 54 2 18 108 1 #E stands for Executor Except too small or too big executor, the performance is relatively close 18 cores 108GB memory per node# The lower the better
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Or Small Executor ?  Avoid too small or too big executor. – Small executor will decrease CPU and memory efficiency. – Big executor will introduce heavy GC overhead.  Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Any Other Thing ?  Executor memory != Container memory  Container memory = executor memory + overhead memory (10% of executory memory)  Leave some resources to other service and system  Enable CPU scheduling if you want to constrain CPU usage# #CPU scheduling - http://hortonworks.com/blog/managing-cpu-resources-in-y
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved #2 Task Parallelism ?  What is the preferable number of partitions (task parallelism) ? rdd.groupByKey(numPartitions) rdd.reduceByKey(func, [numPartitions]) rdd.sortByKey([ascending], [numPartitions])
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Task Parallelism ? map map map reduce1 reduce2
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Task Parallelism ?  Task parallelism == number of reducer  Increasing the task parallelism will reduce the data to be processed for each task map map map reduce1 reduce2 reduce2
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big or Small Task ?  Small task – Fine-grained partition can mitigate data skew problem to some extent. – With relatively small data to be processed in each task can alleviate disk spill. – Task number will be increased, this will somehow increase the scheduling overhead.  Big task – Too big shuffle block (> 2GB) will crash the Spark application. – Easily introduce data skew problem. – Frequent GC problem.  Summary – Fully make using of cluster parallelism. – Consider a preferable task size according to the size of shuffle input (~128MB). – Prefer small task rather than big task (increase the task parallelism).
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benchmark With Different Task Parallelism # The lower the better
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusion  Different workloads have different processing pattern, knowing the pattern will give a right direction to tune  No one rule could fit all situations, based on the monitored behavior to tune workload can save the effort and get a better result  The more you know about the framework, the better you will get the tuning result
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You Vinay Shukla, @neomythos Saisai (Jerry) Shao

Hinweis der Redaktion

  1. Spark on YARN is the only mode with Kerberos security integration
  2. John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  3. The first step of security is network security The second step of security is Authentication Most Hadoop echo system projects rely on Kerberos for Authentication Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus
  4. Client talks to KDC with Kerberos Library Orange line – Client to KDC communication Green line – Client to HDFS communication, does not talk to Kerberos/KDC
  5. John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  6. John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  7. Controlling HDFS Authorization is easy/Done Controlling Hive row/column level authorization in Spark is WIP
  8. For HDFS as Data Source can use RPC or use SSL with WebHDFS For NM Shuffle Data – Use YARN SSL Spark support SSL for FS (Broadcast or File download) Shuffle Block Transfer supports SASL based encryption – SSL coming
  9. All Images from Flicker Commons