Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Spark Performance Tuning
1. Spark UI (Monitor andInspect Jobs).
2. Level of Parallelism (Clusters willnot be fullyutilized...
Nächste SlideShare
Wird geladen in …5
×

Cheat Sheet - Spark Performance Tuning

Some of my personal notes on Apache Spark Performance tuning

  • Als Erste(r) kommentieren

Cheat Sheet - Spark Performance Tuning

  1. 1. Spark Performance Tuning 1. Spark UI (Monitor andInspect Jobs). 2. Level of Parallelism (Clusters willnot be fullyutilized unless the level of parallelism for each operationis high enough. Spark automaticallysets the number of partitions of an input file according to its size andfor distributed shuffles, such as groupByKeyand reduceByKey, it usesthe largest parent RDD’s number of partitions. Youcanpassthe level of parallelism as a second argument to anoperation. In general, 2-3 tasks per CPU core in your cluster are recommended. That said, havingtasks that are too small is alsonot advisable as there is some overheadpaidto schedule andrun a task.As a rule of thumbtasks should take at least 100 ms to execute). 3. Reduce working set size (Operations like groupByKey can fail terriblywhentheir working set is huge. Best way to deal withthis willbe to change the level ofparallism) 4. Avoid groupByKey for associative operations(use operations that cancombine) 5. Multiple Disk (give sparkmultiple disks for intermediate persistence. Thisdone via setting in ResourceManager) 6. Degree of Parallelism (~ 2 to 3 time the number ofcores on Worker nodes) 7. Performance due to chosen Language (Scala > Java >> Python > R) 8. Higher level APIs are better (Use Dataframe for core processing, MLlibfor Machine Learning, SparkSQL for Queryand GraphXfor Graphprocessing) i. 9. Avoid collecting large RDDs (use take or takeSample). 10. Use Dataframe (This is more efficient and uses Catalyst optimizer.) 11. Use Scope as provided in mavento avoidpackaging all the dependencies 12. Filter First, Shuffle next 13. Cache after hard work 14. Spark Streaming – enable backpressure (This willtell kafka to slowdownrate of sending messagesifthe processing time is coming more than batch interval and scheduling delayis increasing) 15. If using Kafka, choose Direct Kafka approach 16. Extend Catalyst Optimizer’s code to add/modifyrules 17. Improve Shuffle Performance: a. Enable LZF or SnappyCompression (for shuffle) b. Enable Kryo Serialization c. Keep shuffle data small(usingreduceByKeyor filter before shuffle) d. No Shuffle block canbe greater than2GB in size. Else exception:size is greater than Interger.MAX_SIZE. Spark uses ByteBuffer for ShuffleBlocks. ByteBuffer is limitedby Integer.MAX_SIZE = 2 GB. Ideally, eachpartition should have roughly128 MB. e. Think about partition/ bucketingaheadof time. f. Do as much as possible witha single shuffle 18. Use cogroup (insteadof rdd.flatmap.join.groupby) 19. Spend time of reading RDD lineage graph (handywayis to read RDD.toDebugString() ) 20. Optimize Join Performance a. Use Salting to avoidSkewKeys. Skew sets are the ones where data is not distributed evenly. One for Few partitions have huge amount of Data in comparison to other partitions. i. Here change the (regular key) to (concatenate (regular key, “:”, randomnumber)). ii. Once this is done, thenfirst dojoin operationonsalted keys andthen do the operationon unsalted keys b. Use partitionBy(new hash partition()) 21. Use Caching (Instead ofMEM_ONLY, use MEM_ONLY_SER. This has better GCfor larger datasets) 22. Always cache after repartition. 23. A Map after partitionBy will lose the partition information. Use mapValue instead 24. Speculative Execution (Enable Speculative executionto tackle stragglers) 25. Coalesce or repartition to avoidmassive partitions (smaller partitions workbetter) 26. Use Broadcast variables 27. Use Kryo Serialization (more compact andfaster than Java Serialization. Kryo is onlysupportedin RDD caching and shuffling– not inSerialize To diskoperations like SaveAsObjectFile)

    Als Erste(r) kommentieren

    Loggen Sie sich ein, um Kommentare anzuzeigen.

  • NISHANTH_J

    Jun. 15, 2017
  • YogeshKumar390

    Jun. 26, 2017
  • SinisaJovic

    Jun. 26, 2017
  • RameshBasina

    Jun. 26, 2017
  • coolabhi0966

    Jun. 27, 2017
  • BharatImmadi1

    Sep. 5, 2017
  • adamtung50

    Oct. 15, 2017
  • LandryAguehounde

    Oct. 22, 2017
  • linnealovespie

    Nov. 5, 2017
  • NELLAIVIJAY1

    Mar. 6, 2018
  • saanviaarna1

    Apr. 20, 2018
  • VijayM30

    May. 10, 2018
  • KarthikeyanRD

    May. 23, 2018
  • bolzano1989

    Nov. 6, 2018
  • varunsharma443

    Jan. 5, 2019
  • tazimehdi

    Apr. 10, 2019
  • Sasinath

    Jun. 10, 2019
  • sriny4c

    Dec. 17, 2019
  • DanielSobrado1

    Jun. 6, 2020

Some of my personal notes on Apache Spark Performance tuning

Aufrufe

Aufrufe insgesamt

3.574

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

57

Befehle

Downloads

20

Geteilt

0

Kommentare

0

Likes

19

×