This document provides an overview of techniques for optimizing Apache Spark pipelines. It discusses fundamentals of Spark execution including jobs, stages and tasks. It then provides recommendations for tuning aspects like sizing executors, using DataFrames/Datasets over RDDs, caching frequently used data, joining techniques to avoid shuffling large datasets, and addressing skew. The document aims to help debug and optimize Spark applications.
1. 1Š Cloudera, Inc. All rights reserved.
Building Efficient Pipelines in
Apache Spark
Jeremy Beard | Principal Solutions Architect, Cloudera
May 2017
2. 2Š Cloudera, Inc. All rights reserved.
Introduction
⢠Jeremy Beard
⢠Principal Solutions Architect at Cloudera
⢠Based in NYC
⢠With Cloudera for 4.5 years
⢠Previously 6 years data warehousing in Australia
⢠jeremy@cloudera.com
3. 3Š Cloudera, Inc. All rights reserved.
New! Cloudera Data Science Workbench
⢠On cluster
data science
⢠Amazing UX
⢠Python
⢠R
⢠Scala
⢠Spark 2
5. 5Š Cloudera, Inc. All rights reserved.
Spark execution breakdown
⢠Application: the single driver program that orchestrates the jobs/stages/tasks
⢠Job: one for each time the Spark application emits data
⢠e.g. write to HDFS, or collect to the driver
⢠Initiated by an âactionâ method call
⢠Stage: one for each part of a job before a shuffle is required
⢠Task: one for each parallelizable unit of work of a stage
⢠A single thread assigned to an executor (virtual) core
6. 6Š Cloudera, Inc. All rights reserved.
The driver and the executors
⢠Together are the JVM processes of the Spark application
⢠The driver
⢠Where the application orchestration/scheduling happens
⢠Where your Spark API calls are run
⢠The executors
⢠Where the data is processed
⢠Where the code you give to Spark API calls is run
7. 7Š Cloudera, Inc. All rights reserved.
Running Spark applications on YARN
⢠Two modes: client and cluster
⢠Client mode runs the driver locally
⢠Driver logs automatically appear on the screen
⢠Good for development
⢠Cluster mode runs the driver as a YARN container on the cluster
⢠Driver logs can be obtained from Spark UI or YARN logs
⢠Driver process is resource managed
⢠Good for production
8. 8Š Cloudera, Inc. All rights reserved.
Debugging your Spark applications
9. 9Š Cloudera, Inc. All rights reserved.
Spark web UI
⢠Each Spark application hosts a web UI
⢠The primary pane of glass for debugging and tuning
⢠Worth learning in depth
⢠Useful for
⢠Seeing the progress of jobs/stages/tasks
⢠Accessing logs
⢠Observing streaming throughput
⢠Monitoring memory usage
10. 10Š Cloudera, Inc. All rights reserved.
Logging
⢠The driver and the executors write to stdout and stderr via log4j
⢠Use log4j in your code to add to these logs
⢠log4j properties can be overridden
⢠Useful for finding full stack traces and for crude logging of code paths
⢠Retrieve logs from Spark UI âExecutorsâ tab
⢠Or if missing, run âyarn logs -applicationId [yarnappid] > [yarnappid].logâ
⢠Note: Driver logs in client mode need to be manually saved
11. 11Š Cloudera, Inc. All rights reserved.
Accumulators
⢠Distributed counters that you can increment in executor code
⢠Spark automatically aggregates them across all executors
⢠Results visible in Spark UI under each stage
⢠Useful for aggregating fine-grained timings and record counts
12. 12Š Cloudera, Inc. All rights reserved.
Explain plan
⢠Prints out how Spark will execute that DataFrame/Dataset
⢠Use DataFrame.explain
⢠Useful for confirming optimizations like broadcast joins
13. 13Š Cloudera, Inc. All rights reserved.
Printing schemas and data
⢠DataFrame.printSchema to print schema to stdout
⢠Useful to confirm that a derived schema was correctly generated
⢠DataFrame.show to print data to stdout as a formatted table
⢠Or DataFrame.limit.show to print a subset
⢠Useful to confirm that intermediate data is valid
14. 14Š Cloudera, Inc. All rights reserved.
Job descriptions
⢠SparkContext.setJobDescription to label the job in the Spark UI
⢠Useful for identifying how the Spark jobs/stages correspond to your code
16. 16Š Cloudera, Inc. All rights reserved.
Sizing the executors
⢠Size comes from the number of cores and amount of memory
⢠Cores are virtual, corresponds to YARN resource requests
⢠Memory is physical, and YARN will enforce it
⢠Generally aim for 4 to 6 cores per executor
⢠Generally keep executor memory under 24-32GB to avoid GC issues
⢠Driver can be sized too, but usually doesnât need more than defaults
17. 17Š Cloudera, Inc. All rights reserved.
Advanced executor memory tuning
⢠Turn off legacy memory management
⢠spark.memory.useLegacyMode = false
⢠If executors being killed by YARN, try increasing YARN overhead
⢠spark.yarn.executor.memoryOverhead
⢠To finely tune the memory usage of the executors, look into
⢠spark.memory.fraction
⢠spark.memory.storageFraction
18. 18Š Cloudera, Inc. All rights reserved.
Sizing the number of executors
⢠Dynamic allocation
⢠Spark requests more executors as tasks queue up, and vice versa releases them
⢠Good choice for optimal cluster utilization
⢠On by default in CDH if number of executors is not specified
⢠Static allocation
⢠User requests static number of executors for lifetime of application
⢠Reduces time spent requesting/releasing executors
⢠Can be very wasteful in bursty workloads, like interactive shells/notebooks
19. 19Š Cloudera, Inc. All rights reserved.
DataFrame/Dataset API
⢠Use the DataFrame/Dataset API over the RDD API where possible
⢠Much more efficient execution
⢠Is where all the future optimizations are being made
⢠Look for RDDs in your code and see if they could be DataFrames/Datasets instead
20. 20Š Cloudera, Inc. All rights reserved.
Caching
⢠First use of a cached DataFrame will cache the results into executor memory
⢠Subsequent uses will read the cached results instead of recalculating
⢠Look for any DataFrame that is used more than once as a candidate for caching
⢠DataFrame.cache will mark as cached with default options
⢠DataFrame.persist will mark as cached with specified options
⢠Replication (default replication = 1)
⢠Serialization (default deserialized)
⢠Spill (default spills to disk)
21. 21Š Cloudera, Inc. All rights reserved.
Scala vs Java vs Python
⢠Scala and Java Spark APIs have effectively the same performance
⢠Python Spark API is a mixed story
⢠Python driver code is not a performance hit
⢠Python executor code incurs a heavy serialization cost
⢠Avoid writing custom code if the API can already achieve it
22. 22Š Cloudera, Inc. All rights reserved.
Serialization
⢠Spark supports Java and Kryo serialization for shuffling data
⢠Kryo is generally much faster than Java
⢠Kryo is on by default on CDH
⢠Java is on by default on upstream Apache Spark
23. 23Š Cloudera, Inc. All rights reserved.
Broadcast joins
⢠Efficient way to join very large to very small
⢠Instead of shuffling both, the very small is broadcast to the very large
⢠No shuffle of the very large DataFrame required
⢠Very small DataFrame must fit in memory of driver and executors
⢠Automatically applied if Spark knows the very small DataFrame is <10MB
⢠If Spark doesnât know, you can hint it with broadcast(DataFrame)
24. 24Š Cloudera, Inc. All rights reserved.
Shuffle partitions
⢠Spark SQL uses a configuration to specify number of partitions after a shuffle
⢠The âmagic numberâ of Spark tuning
⢠Usually takes trial and error to find the optimal value for an application
⢠Default is 200
⢠Rough rule of thumb is 1 per 128MB of shuffled data
⢠If close to 2000, use 2001 instead to kick in more efficient implementation
25. 25Š Cloudera, Inc. All rights reserved.
Object instantiation
⢠Avoid creating heavy objects for each record processed
⢠Look for large fraction of task time spent on GC in Spark UI Executors tab
⢠Try to re-use heavy objects across many records
⢠Use constructor to instantiate once for task
⢠Or use mapPartitions to instantiate at start of task
⢠Or use singleton to instantiate once for executor lifetime
26. 26Š Cloudera, Inc. All rights reserved.
Skew
⢠Where processing is concentrated on a small subset of tasks
⢠Can lead to very slow applications
⢠Look for stages where one or a few tasks are much slower than the rest
⢠Common cause is a join where the join key only has one or a few unique values
⢠If this is expected, a broadcast join may avoid the skew
27. 27Š Cloudera, Inc. All rights reserved.
More resources
⢠Spark website
⢠http://spark.apache.org/docs/latest/tuning.html
⢠High Performance Spark book
⢠http://shop.oreilly.com/product/0636920046967.do
⢠Cloudera blog posts
⢠http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
⢠http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/