SlideShare ist ein Scribd-Unternehmen logo
1 von 28
1Š Cloudera, Inc. All rights reserved.
Building Efficient Pipelines in
Apache Spark
Jeremy Beard | Principal Solutions Architect, Cloudera
May 2017
2Š Cloudera, Inc. All rights reserved.
Introduction
• Jeremy Beard
• Principal Solutions Architect at Cloudera
• Based in NYC
• With Cloudera for 4.5 years
• Previously 6 years data warehousing in Australia
• jeremy@cloudera.com
3Š Cloudera, Inc. All rights reserved.
New! Cloudera Data Science Workbench
• On cluster
data science
• Amazing UX
• Python
• R
• Scala
• Spark 2
4Š Cloudera, Inc. All rights reserved.
Spark execution fundamentals
5Š Cloudera, Inc. All rights reserved.
Spark execution breakdown
• Application: the single driver program that orchestrates the jobs/stages/tasks
• Job: one for each time the Spark application emits data
• e.g. write to HDFS, or collect to the driver
• Initiated by an “action” method call
• Stage: one for each part of a job before a shuffle is required
• Task: one for each parallelizable unit of work of a stage
• A single thread assigned to an executor (virtual) core
6Š Cloudera, Inc. All rights reserved.
The driver and the executors
• Together are the JVM processes of the Spark application
• The driver
• Where the application orchestration/scheduling happens
• Where your Spark API calls are run
• The executors
• Where the data is processed
• Where the code you give to Spark API calls is run
7Š Cloudera, Inc. All rights reserved.
Running Spark applications on YARN
• Two modes: client and cluster
• Client mode runs the driver locally
• Driver logs automatically appear on the screen
• Good for development
• Cluster mode runs the driver as a YARN container on the cluster
• Driver logs can be obtained from Spark UI or YARN logs
• Driver process is resource managed
• Good for production
8Š Cloudera, Inc. All rights reserved.
Debugging your Spark applications
9Š Cloudera, Inc. All rights reserved.
Spark web UI
• Each Spark application hosts a web UI
• The primary pane of glass for debugging and tuning
• Worth learning in depth
• Useful for
• Seeing the progress of jobs/stages/tasks
• Accessing logs
• Observing streaming throughput
• Monitoring memory usage
10Š Cloudera, Inc. All rights reserved.
Logging
• The driver and the executors write to stdout and stderr via log4j
• Use log4j in your code to add to these logs
• log4j properties can be overridden
• Useful for finding full stack traces and for crude logging of code paths
• Retrieve logs from Spark UI ‘Executors’ tab
• Or if missing, run “yarn logs -applicationId [yarnappid] > [yarnappid].log”
• Note: Driver logs in client mode need to be manually saved
11Š Cloudera, Inc. All rights reserved.
Accumulators
• Distributed counters that you can increment in executor code
• Spark automatically aggregates them across all executors
• Results visible in Spark UI under each stage
• Useful for aggregating fine-grained timings and record counts
12Š Cloudera, Inc. All rights reserved.
Explain plan
• Prints out how Spark will execute that DataFrame/Dataset
• Use DataFrame.explain
• Useful for confirming optimizations like broadcast joins
13Š Cloudera, Inc. All rights reserved.
Printing schemas and data
• DataFrame.printSchema to print schema to stdout
• Useful to confirm that a derived schema was correctly generated
• DataFrame.show to print data to stdout as a formatted table
• Or DataFrame.limit.show to print a subset
• Useful to confirm that intermediate data is valid
14Š Cloudera, Inc. All rights reserved.
Job descriptions
• SparkContext.setJobDescription to label the job in the Spark UI
• Useful for identifying how the Spark jobs/stages correspond to your code
15Š Cloudera, Inc. All rights reserved.
Tuning your Spark pipelines
16Š Cloudera, Inc. All rights reserved.
Sizing the executors
• Size comes from the number of cores and amount of memory
• Cores are virtual, corresponds to YARN resource requests
• Memory is physical, and YARN will enforce it
• Generally aim for 4 to 6 cores per executor
• Generally keep executor memory under 24-32GB to avoid GC issues
• Driver can be sized too, but usually doesn’t need more than defaults
17Š Cloudera, Inc. All rights reserved.
Advanced executor memory tuning
• Turn off legacy memory management
• spark.memory.useLegacyMode = false
• If executors being killed by YARN, try increasing YARN overhead
• spark.yarn.executor.memoryOverhead
• To finely tune the memory usage of the executors, look into
• spark.memory.fraction
• spark.memory.storageFraction
18Š Cloudera, Inc. All rights reserved.
Sizing the number of executors
• Dynamic allocation
• Spark requests more executors as tasks queue up, and vice versa releases them
• Good choice for optimal cluster utilization
• On by default in CDH if number of executors is not specified
• Static allocation
• User requests static number of executors for lifetime of application
• Reduces time spent requesting/releasing executors
• Can be very wasteful in bursty workloads, like interactive shells/notebooks
19Š Cloudera, Inc. All rights reserved.
DataFrame/Dataset API
• Use the DataFrame/Dataset API over the RDD API where possible
• Much more efficient execution
• Is where all the future optimizations are being made
• Look for RDDs in your code and see if they could be DataFrames/Datasets instead
20Š Cloudera, Inc. All rights reserved.
Caching
• First use of a cached DataFrame will cache the results into executor memory
• Subsequent uses will read the cached results instead of recalculating
• Look for any DataFrame that is used more than once as a candidate for caching
• DataFrame.cache will mark as cached with default options
• DataFrame.persist will mark as cached with specified options
• Replication (default replication = 1)
• Serialization (default deserialized)
• Spill (default spills to disk)
21Š Cloudera, Inc. All rights reserved.
Scala vs Java vs Python
• Scala and Java Spark APIs have effectively the same performance
• Python Spark API is a mixed story
• Python driver code is not a performance hit
• Python executor code incurs a heavy serialization cost
• Avoid writing custom code if the API can already achieve it
22Š Cloudera, Inc. All rights reserved.
Serialization
• Spark supports Java and Kryo serialization for shuffling data
• Kryo is generally much faster than Java
• Kryo is on by default on CDH
• Java is on by default on upstream Apache Spark
23Š Cloudera, Inc. All rights reserved.
Broadcast joins
• Efficient way to join very large to very small
• Instead of shuffling both, the very small is broadcast to the very large
• No shuffle of the very large DataFrame required
• Very small DataFrame must fit in memory of driver and executors
• Automatically applied if Spark knows the very small DataFrame is <10MB
• If Spark doesn’t know, you can hint it with broadcast(DataFrame)
24Š Cloudera, Inc. All rights reserved.
Shuffle partitions
• Spark SQL uses a configuration to specify number of partitions after a shuffle
• The ‘magic number’ of Spark tuning
• Usually takes trial and error to find the optimal value for an application
• Default is 200
• Rough rule of thumb is 1 per 128MB of shuffled data
• If close to 2000, use 2001 instead to kick in more efficient implementation
25Š Cloudera, Inc. All rights reserved.
Object instantiation
• Avoid creating heavy objects for each record processed
• Look for large fraction of task time spent on GC in Spark UI Executors tab
• Try to re-use heavy objects across many records
• Use constructor to instantiate once for task
• Or use mapPartitions to instantiate at start of task
• Or use singleton to instantiate once for executor lifetime
26Š Cloudera, Inc. All rights reserved.
Skew
• Where processing is concentrated on a small subset of tasks
• Can lead to very slow applications
• Look for stages where one or a few tasks are much slower than the rest
• Common cause is a join where the join key only has one or a few unique values
• If this is expected, a broadcast join may avoid the skew
27Š Cloudera, Inc. All rights reserved.
More resources
• Spark website
• http://spark.apache.org/docs/latest/tuning.html
• High Performance Spark book
• http://shop.oreilly.com/product/0636920046967.do
• Cloudera blog posts
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
28Š Cloudera, Inc. All rights reserved.
Thank you
jeremy@cloudera.com

Weitere ähnliche Inhalte

Was ist angesagt?

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 

Was ist angesagt? (20)

Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Migrating and Running DBs on Amazon RDS for Oracle
Migrating and Running DBs on Amazon RDS for OracleMigrating and Running DBs on Amazon RDS for Oracle
Migrating and Running DBs on Amazon RDS for Oracle
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 

Ähnlich wie Building Efficient Pipelines in Apache Spark

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 

Ähnlich wie Building Efficient Pipelines in Apache Spark (20)

Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
YARN
YARNYARN
YARN
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionBreaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 

KĂźrzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Christopher Logan Kennedy
 

KĂźrzlich hochgeladen (20)

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Building Efficient Pipelines in Apache Spark

  • 1. 1Š Cloudera, Inc. All rights reserved. Building Efficient Pipelines in Apache Spark Jeremy Beard | Principal Solutions Architect, Cloudera May 2017
  • 2. 2Š Cloudera, Inc. All rights reserved. Introduction • Jeremy Beard • Principal Solutions Architect at Cloudera • Based in NYC • With Cloudera for 4.5 years • Previously 6 years data warehousing in Australia • jeremy@cloudera.com
  • 3. 3Š Cloudera, Inc. All rights reserved. New! Cloudera Data Science Workbench • On cluster data science • Amazing UX • Python • R • Scala • Spark 2
  • 4. 4Š Cloudera, Inc. All rights reserved. Spark execution fundamentals
  • 5. 5Š Cloudera, Inc. All rights reserved. Spark execution breakdown • Application: the single driver program that orchestrates the jobs/stages/tasks • Job: one for each time the Spark application emits data • e.g. write to HDFS, or collect to the driver • Initiated by an “action” method call • Stage: one for each part of a job before a shuffle is required • Task: one for each parallelizable unit of work of a stage • A single thread assigned to an executor (virtual) core
  • 6. 6Š Cloudera, Inc. All rights reserved. The driver and the executors • Together are the JVM processes of the Spark application • The driver • Where the application orchestration/scheduling happens • Where your Spark API calls are run • The executors • Where the data is processed • Where the code you give to Spark API calls is run
  • 7. 7Š Cloudera, Inc. All rights reserved. Running Spark applications on YARN • Two modes: client and cluster • Client mode runs the driver locally • Driver logs automatically appear on the screen • Good for development • Cluster mode runs the driver as a YARN container on the cluster • Driver logs can be obtained from Spark UI or YARN logs • Driver process is resource managed • Good for production
  • 8. 8Š Cloudera, Inc. All rights reserved. Debugging your Spark applications
  • 9. 9Š Cloudera, Inc. All rights reserved. Spark web UI • Each Spark application hosts a web UI • The primary pane of glass for debugging and tuning • Worth learning in depth • Useful for • Seeing the progress of jobs/stages/tasks • Accessing logs • Observing streaming throughput • Monitoring memory usage
  • 10. 10Š Cloudera, Inc. All rights reserved. Logging • The driver and the executors write to stdout and stderr via log4j • Use log4j in your code to add to these logs • log4j properties can be overridden • Useful for finding full stack traces and for crude logging of code paths • Retrieve logs from Spark UI ‘Executors’ tab • Or if missing, run “yarn logs -applicationId [yarnappid] > [yarnappid].log” • Note: Driver logs in client mode need to be manually saved
  • 11. 11Š Cloudera, Inc. All rights reserved. Accumulators • Distributed counters that you can increment in executor code • Spark automatically aggregates them across all executors • Results visible in Spark UI under each stage • Useful for aggregating fine-grained timings and record counts
  • 12. 12Š Cloudera, Inc. All rights reserved. Explain plan • Prints out how Spark will execute that DataFrame/Dataset • Use DataFrame.explain • Useful for confirming optimizations like broadcast joins
  • 13. 13Š Cloudera, Inc. All rights reserved. Printing schemas and data • DataFrame.printSchema to print schema to stdout • Useful to confirm that a derived schema was correctly generated • DataFrame.show to print data to stdout as a formatted table • Or DataFrame.limit.show to print a subset • Useful to confirm that intermediate data is valid
  • 14. 14Š Cloudera, Inc. All rights reserved. Job descriptions • SparkContext.setJobDescription to label the job in the Spark UI • Useful for identifying how the Spark jobs/stages correspond to your code
  • 15. 15Š Cloudera, Inc. All rights reserved. Tuning your Spark pipelines
  • 16. 16Š Cloudera, Inc. All rights reserved. Sizing the executors • Size comes from the number of cores and amount of memory • Cores are virtual, corresponds to YARN resource requests • Memory is physical, and YARN will enforce it • Generally aim for 4 to 6 cores per executor • Generally keep executor memory under 24-32GB to avoid GC issues • Driver can be sized too, but usually doesn’t need more than defaults
  • 17. 17Š Cloudera, Inc. All rights reserved. Advanced executor memory tuning • Turn off legacy memory management • spark.memory.useLegacyMode = false • If executors being killed by YARN, try increasing YARN overhead • spark.yarn.executor.memoryOverhead • To finely tune the memory usage of the executors, look into • spark.memory.fraction • spark.memory.storageFraction
  • 18. 18Š Cloudera, Inc. All rights reserved. Sizing the number of executors • Dynamic allocation • Spark requests more executors as tasks queue up, and vice versa releases them • Good choice for optimal cluster utilization • On by default in CDH if number of executors is not specified • Static allocation • User requests static number of executors for lifetime of application • Reduces time spent requesting/releasing executors • Can be very wasteful in bursty workloads, like interactive shells/notebooks
  • 19. 19Š Cloudera, Inc. All rights reserved. DataFrame/Dataset API • Use the DataFrame/Dataset API over the RDD API where possible • Much more efficient execution • Is where all the future optimizations are being made • Look for RDDs in your code and see if they could be DataFrames/Datasets instead
  • 20. 20Š Cloudera, Inc. All rights reserved. Caching • First use of a cached DataFrame will cache the results into executor memory • Subsequent uses will read the cached results instead of recalculating • Look for any DataFrame that is used more than once as a candidate for caching • DataFrame.cache will mark as cached with default options • DataFrame.persist will mark as cached with specified options • Replication (default replication = 1) • Serialization (default deserialized) • Spill (default spills to disk)
  • 21. 21Š Cloudera, Inc. All rights reserved. Scala vs Java vs Python • Scala and Java Spark APIs have effectively the same performance • Python Spark API is a mixed story • Python driver code is not a performance hit • Python executor code incurs a heavy serialization cost • Avoid writing custom code if the API can already achieve it
  • 22. 22Š Cloudera, Inc. All rights reserved. Serialization • Spark supports Java and Kryo serialization for shuffling data • Kryo is generally much faster than Java • Kryo is on by default on CDH • Java is on by default on upstream Apache Spark
  • 23. 23Š Cloudera, Inc. All rights reserved. Broadcast joins • Efficient way to join very large to very small • Instead of shuffling both, the very small is broadcast to the very large • No shuffle of the very large DataFrame required • Very small DataFrame must fit in memory of driver and executors • Automatically applied if Spark knows the very small DataFrame is <10MB • If Spark doesn’t know, you can hint it with broadcast(DataFrame)
  • 24. 24Š Cloudera, Inc. All rights reserved. Shuffle partitions • Spark SQL uses a configuration to specify number of partitions after a shuffle • The ‘magic number’ of Spark tuning • Usually takes trial and error to find the optimal value for an application • Default is 200 • Rough rule of thumb is 1 per 128MB of shuffled data • If close to 2000, use 2001 instead to kick in more efficient implementation
  • 25. 25Š Cloudera, Inc. All rights reserved. Object instantiation • Avoid creating heavy objects for each record processed • Look for large fraction of task time spent on GC in Spark UI Executors tab • Try to re-use heavy objects across many records • Use constructor to instantiate once for task • Or use mapPartitions to instantiate at start of task • Or use singleton to instantiate once for executor lifetime
  • 26. 26Š Cloudera, Inc. All rights reserved. Skew • Where processing is concentrated on a small subset of tasks • Can lead to very slow applications • Look for stages where one or a few tasks are much slower than the rest • Common cause is a join where the join key only has one or a few unique values • If this is expected, a broadcast join may avoid the skew
  • 27. 27Š Cloudera, Inc. All rights reserved. More resources • Spark website • http://spark.apache.org/docs/latest/tuning.html • High Performance Spark book • http://shop.oreilly.com/product/0636920046967.do • Cloudera blog posts • http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ • http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
  • 28. 28Š Cloudera, Inc. All rights reserved. Thank you jeremy@cloudera.com