Spark Internals Drive Design

•

10 gefällt mir•2,743 views

This document discusses lessons learned from working with Spark's machine learning library (ML Lib) for collaborative filtering on a large dataset. It covers four main lessons: 1. Spark uses more memory than expected due to JVM overhead, metadata for shuffles and jobs, and Scala vs Java. This can be addressed through careful partitioning, serialization with Kryo, and cleaning up long-running jobs. 2. Shuffles between nodes are expensive and can cause out of memory errors, so it is best to avoid them by using the driver for collection, broadcast variables, and accumulators. 3. Sending data through the driver has memory limits, so partitions and akka frame sizes must be configured based

Daten & Analysen

(or: How I learned to stop worrying and love the shuffle)
By: Ilya Ganelin
Lessons from Spark

•  Recommendations (ML Lib)
–  Collaborative Filtering (CF)
–  65 million X 10 million
–  2.5 TB, ~12 PFlops
–  All recs (long tail)
•  You’re mad!
–  Don’t usually generate all possible
recs
–  6 data nodes (~40 GB RAM)
–  Shared cluster
–  Spark stability
What:

•  Goal:
–  Understand how Spark internals drive design and configuration
•  Contents:
–  Background
•  Partitions
•  Caching
•  Serialization
•  Shuffle
–  Lessons 1-4
Overview

•  Partitions
–  How data is split on disk
–  Affects memory usage, shuffle size
–  Count ~ speed, Count ~ 1/memory
•  Caching
–  Persist RDDs in distributed memory
–  Major speedup for repeated operations
•  Serialization
–  Efficient movement of data
–  Java vs. Kryo
Partitions, Caching, and Serialization

Shuffle!
•  All-all operations
–  reduceByKey, groupByKey
•  Data movement
–  Serialization
–  Akka
•  Memory overhead
–  Dumps to disk when OOM
–  Garbage collection
•  EXPENSIVE!
Map Reduce

•  Memory
–  You’re using more than you think
•  JVM overhead
•  Spark metadata (shuffles, long-running jobs)
•  Scala vs. Java
–  Shuffle & heap
•  Debugging is hard
–  Distributed logs
–  Hundreds of tasks
Lesson 1: Spark is a problem child!

•  Tame the beast (memory)
–  Partition wisely
–  Know your data!
•  Size, Types, Distribution
–  Kryo Serialization
•  Cleanup
–  Long-term jobs consume memory indefinitely
–  Spark context cleanup fails in production environment
–  Solution: YARN!
•  Separate spark-submits per batch
•  Stable Spark-based job that runs for weeks
Lesson 1: Discipline

•  Why?
–  Speed up execution
–  Increase stability
–  ????
–  Profit!
•  How?
–  Use the driver!
•  Collect
•  Broadcast
•  Accumulators
Lesson 2: Avoid shuffles!

•  Limited memory
–  Collected RDDs
–  Metadata
–  Results (Accumulators)
•  Akka messaging
–  10e6 x (120 bytes) ~ 1.2GB; 20 partitions
–  Read ~60 MB per partition – (Default is 10MB)
–  Solution: Partition & set akka.frameSize - know your data!
•  Big data
–  Solution: Batch process
•  Problem: Cleanup and long-term stability
Lesson 3: Using the driver is hard!

•  Cache, but cache wisely
–  If you use it twice, cache it
•  Broadcast variables
–  Visible to all executors
–  Only serialized once
–  Blazing-fast lookups!
•  Threading
–  Thread pool on driver
–  Fast operations, many tasks
•  75x speedup over ML Lib ALS predict()
–  Start: 1 rec / 1.5 seconds
–  End: 50 recs / second
Lesson 4: Speed!

Empfohlen

Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit

Low Latency Execution For Apache SparkJen Aman

700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan

Spark Tips & TricksJason Hubbard

Empfohlen

Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit

Low Latency Execution For Apache SparkJen Aman

700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan

Spark Tips & TricksJason Hubbard

Spark on YARNAdarsh Pannu

Spark Community Update - Spark Summit San Francisco 2015Databricks

A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...Spark Summit

Top 5 mistakes when writing Spark applicationshadooparchbook

Emr spark tuning demystifiedOmid Vahdaty

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Productionizing Spark and the Spark Job ServerEvan Chan

Transactional writes to cloud storage with Eric LiangDatabricks

CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark 1.6 vs Spark 2.0Sigmoid

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Deploying Accelerators At Datacenter Scale Using SparkJen Aman

Zeppelin and spark sql demystifiedOmid Vahdaty

Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks

Cassandra internalsnarsiman

Top 5 mistakes when writing Spark applicationshadooparchbook

Using Spark with Tachyon by Gene PangSpark Summit

December 2013 HUG: Spark at Yahoo!Yahoo Developer Network

Next-Gen Decision Making in Under 2msIlya Ganelin

Weitere ähnliche Inhalte

Was ist angesagt?

Spark on YARNAdarsh Pannu

Spark Community Update - Spark Summit San Francisco 2015Databricks

A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...Spark Summit

Top 5 mistakes when writing Spark applicationshadooparchbook

Emr spark tuning demystifiedOmid Vahdaty

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Productionizing Spark and the Spark Job ServerEvan Chan

Transactional writes to cloud storage with Eric LiangDatabricks

CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark 1.6 vs Spark 2.0Sigmoid

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Deploying Accelerators At Datacenter Scale Using SparkJen Aman

Zeppelin and spark sql demystifiedOmid Vahdaty

Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks

Cassandra internalsnarsiman

Top 5 mistakes when writing Spark applicationshadooparchbook

Using Spark with Tachyon by Gene PangSpark Summit

Was ist angesagt? (20)

Spark on YARN

Spark Community Update - Spark Summit San Francisco 2015

A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...

Top 5 mistakes when writing Spark applications

Emr spark tuning demystified

Processing Large Data with Apache Spark -- HasGeek

Productionizing Spark and the Spark Job Server

Transactional writes to cloud storage with Eric Liang

CaffeOnSpark Update: Recent Enhancements and Use Cases

Re-Architecting Spark For Performance Understandability

Spark 1.6 vs Spark 2.0

Top 5 Mistakes When Writing Spark Applications

Deploying Accelerators At Datacenter Scale Using Spark

Zeppelin and spark sql demystified

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Spark performance tuning - Maksud Ibrahimov

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Cassandra internals

Top 5 mistakes when writing Spark applications

Using Spark with Tachyon by Gene Pang

Andere mochten auch

December 2013 HUG: Spark at Yahoo!Yahoo Developer Network

Next-Gen Decision Making in Under 2msIlya Ganelin

Airstream: Spark Streaming At AirbnbJen Aman

Food Recommendation System Using Clustering Analysis for Diabetic patientsMaiyaporn Phanich

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Spark Summit

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit

Loan Decisioning TransformationDataWorks Summit/Hadoop Summit

Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)Spark Summit

Spark Uber Development KitDataWorks Summit/Hadoop Summit

Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit

Active Learning for Fraud PreventionDataWorks Summit/Hadoop Summit

SQL and Search with Spark in your browserDataWorks Summit/Hadoop Summit

Apache hadoop bigdata-in-bankingm_hepburn

A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Appraiser: How Airbnb Generates Complex Models in Spark for Demand Prediction...Spark Summit

Analysis of Major Trends in Big Data AnalyticsDataWorks Summit/Hadoop Summit

Effective Spark on Multi-Tenant ClustersDataWorks Summit/Hadoop Summit

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Andere mochten auch (20)

December 2013 HUG: Spark at Yahoo!

Next-Gen Decision Making in Under 2ms

Airstream: Spark Streaming At Airbnb

Food Recommendation System Using Clustering Analysis for Diabetic patients

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...

Loan Decisioning Transformation

Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)

Spark Uber Development Kit

Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)

Active Learning for Fraud Prevention

SQL and Search with Spark in your browser

Apache hadoop bigdata-in-banking

A New "Sparkitecture" for modernizing your data warehouse

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Appraiser: How Airbnb Generates Complex Models in Spark for Demand Prediction...

Analysis of Major Trends in Big Data Analytics

Effective Spark on Multi-Tenant Clusters

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Ähnlich wie Spark Internals Drive Design

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

Top 5 mistakes when writing Spark applicationsmarkgrover

Hadoop - Disk Fail In Place (DFIP)mundlapudi

In-memory Data Management Trends & TechniquesHazelcast

Computer Memory Hierarchy Computer ArchitectureHaris456

9_Storage_Devices.pptxAbdulhseynAayev1

9_Storage_Devices.pptxJawaharPrasad3

Data mining 2011 09MapR Technologies

Sheepdog: yet another all in-one storage for openstackLiu Yuan

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit

Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy

Percona live linux filesystems and my sqlMichael Zhang

Colvin exadata mistakes_ioug_2014marvin herrera

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)srisatish ambati

In-Memory Computing: How, Why? and common PatternsSrinath Perera

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Apache Spark Core – Practical OptimizationDatabricks

Ähnlich wie Spark Internals Drive Design (20)

How to Actually Tune Your Spark Jobs So They Work

Top 5 mistakes when writing Spark applications

Hadoop - Disk Fail In Place (DFIP)

In-memory Data Management Trends & Techniques

Computer Memory Hierarchy Computer Architecture

9_Storage_Devices.pptx

Data mining 2011 09

Sheepdog: yet another all in-one storage for openstack

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska

Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...

Percona live linux filesystems and my sql

Colvin exadata mistakes_ioug_2014

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

In-Memory Computing: How, Why? and common Patterns

Fabian Hueske – Juggling with Bits and Bytes

Apache Spark Core – Practical Optimization

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit

Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

Powering a Startup with Apache Spark with Kevin KimSpark Summit

Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit

How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit

Goal Based Data Production with Sim SimeonovSpark Summit

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit

Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Kürzlich hochgeladen

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

IMA MSN - Medical Students Network (2).pptxdolaknnilon

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

How we prevented account sharing with MFAAndrei Kaleshka

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Kürzlich hochgeladen (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

MK KOMUNIKASI DATA (TI)komdat komdat.docx

Heart Disease Classification Report: A Data Analysis Project

Advanced Machine Learning for Business Professionals

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

GA4 Without Cookies [Measure Camp AMS]

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

20240419 - Measurecamp Amsterdam - SAM.pdf

Generative AI for Social Good at Open Data Science East 2024

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

E-Commerce Order PredictionShraddha Kamble.pptx

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Top 5 Best Data Analytics Courses In Queens

Defining Constituents, Data Vizzes and Telling a Data Story

IMA MSN - Medical Students Network (2).pptx

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

How we prevented account sharing with MFA

Customer Service Analytics - Make Sense of All Your Data.pptx

Spark Internals Drive Design

1. (or: How I learned to stop worrying and love the shuffle) By: Ilya Ganelin Lessons from Spark

2. •  Recommendations (ML Lib) –  Collaborative Filtering (CF) –  65 million X 10 million –  2.5 TB, ~12 PFlops –  All recs (long tail) •  You’re mad! –  Don’t usually generate all possible recs –  6 data nodes (~40 GB RAM) –  Shared cluster –  Spark stability What:

3. •  Goal: –  Understand how Spark internals drive design and configuration •  Contents: –  Background •  Partitions •  Caching •  Serialization •  Shuffle –  Lessons 1-4 Overview

4. Background

5. •  Partitions –  How data is split on disk –  Affects memory usage, shuffle size –  Count ~ speed, Count ~ 1/memory •  Caching –  Persist RDDs in distributed memory –  Major speedup for repeated operations •  Serialization –  Efficient movement of data –  Java vs. Kryo Partitions, Caching, and Serialization

6. Shuffle?

7. Shuffle! •  All-all operations –  reduceByKey, groupByKey •  Data movement –  Serialization –  Akka •  Memory overhead –  Dumps to disk when OOM –  Garbage collection •  EXPENSIVE! Map Reduce

8. Lessons

9. •  Memory –  You’re using more than you think •  JVM overhead •  Spark metadata (shuffles, long-running jobs) •  Scala vs. Java –  Shuffle & heap •  Debugging is hard –  Distributed logs –  Hundreds of tasks Lesson 1: Spark is a problem child!

10. •  Tame the beast (memory) –  Partition wisely –  Know your data! •  Size, Types, Distribution –  Kryo Serialization •  Cleanup –  Long-term jobs consume memory indefinitely –  Spark context cleanup fails in production environment –  Solution: YARN! •  Separate spark-submits per batch •  Stable Spark-based job that runs for weeks Lesson 1: Discipline

11. •  Why? –  Speed up execution –  Increase stability –  ???? –  Profit! •  How? –  Use the driver! •  Collect •  Broadcast •  Accumulators Lesson 2: Avoid shuffles!

12. •  Limited memory –  Collected RDDs –  Metadata –  Results (Accumulators) •  Akka messaging –  10e6 x (120 bytes) ~ 1.2GB; 20 partitions –  Read ~60 MB per partition – (Default is 10MB) –  Solution: Partition & set akka.frameSize - know your data! •  Big data –  Solution: Batch process •  Problem: Cleanup and long-term stability Lesson 3: Using the driver is hard!

13. •  Cache, but cache wisely –  If you use it twice, cache it •  Broadcast variables –  Visible to all executors –  Only serialized once –  Blazing-fast lookups! •  Threading –  Thread pool on driver –  Fast operations, many tasks •  75x speedup over ML Lib ALS predict() –  Start: 1 rec / 1.5 seconds –  End: 50 recs / second Lesson 4: Speed!

14. Questions?