Spark Summit EU talk by Simon Whitear

•

4 gefällt mir•2,894 views

Spark Summit

SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster

Daten & Analysen

SPARK SUMMIT
EUROPE2016
Sparklint
a Tool for Identifying and Tuning Inefficient
Spark Jobs Across Your Cluster
Simon Whitear
Principal Engineer @ Groupon

Why Sparklint?
• A successful Spark cluster grows rapidly
• Capacity and capability mismatches arise
• Leads to resource contention
• Tuning process is non-trivial
• Current UI operational in focus
We wanted to understand application efficiency

Sparklint provides:
• Live view of batch & streaming application stats
or
• Event by event analysis of historical event logs
• Stats and graphs for:
– Idle time
– Core usage
– Task locality

Demo…
• Simulated workload analyzing site access logs:
– read text file as JSON
– convert to Record(ip, verb, status, time)
– countByIp, countByStatus, countByVerb

Job took 10m7s to finish
Already pretty good
distribution; low idle time
indicates good worker
usage, minimal driver node
interaction in job
But overallutilization is low
Which is reflected in the
common occurrence of the
IDLE state (unused cores)

Job took 15m14s to finish
Core usage increased, job is more
efficient, execution time increased,
but the app is not cpu bound

Job took 9m24s to finish
Core utilization decreased
proportionally, trading execution time
for efficiency
Lots of IDLE state shows
we are over allocating
resources

Job took 11m34s to finish
Dynamic allocation only effective at
app start due to long
executorIdleTimeout setting
Core utilization remains
low, the config settings
are not right for this
workload.

Job took 33m5s to finish Core utilization is up, but execution time is up
dramatically due to reclaiming resources before
each short running task.
IDLE state is reduced to a minimum, looks
efficient, but execution is much slower due to
dynamic allocation overhead
Executor churn!

Job took 7m34s to finish
Core utilization way up,
with lower execution time
Flat tops show we are
becoming CPU bound
Parallel execution is
clearly visible in
overlapping stages

Job took 5m6s to finish
Core utilization decreases,
trading execution time for
efficiency again here

Thanks to dynamic allocation the
utilization is high despite being a bi-
modal application
Data loading and mapping requires
a large core count to get throughput
Aggregation and IO of results
optimized for end file size,
therefore requires less cores

Future Features:
• Increased job & stage detail in UI
• History Server event sources
• Inline recommendations
• Auto-tuning
• Streaming stage parameter delegation

The Credit:
• Lead developer is Robert Xue
• https://github.com/roboxue
• SDE @ Groupon

Contribute!
Sparklint is OSS:
https://github.com/groupon/sparklint

SPARK SUMMIT
EUROPE2016
THANK YOU.
swhitear@groupon.com

Empfohlen

Best Practices for Using Apache Spark on AWSAmazon Web Services

SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Power of the Log: LSM & Append Only Data Structuresconfluent

Autoscaling Flink with Reactive ModeFlink Forward

Apache Arrow Flight OverviewJacques Nadeau

Deep Dive: Memory Management in Apache SparkDatabricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Empfohlen

Best Practices for Using Apache Spark on AWSAmazon Web Services

SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Power of the Log: LSM & Append Only Data Structuresconfluent

Autoscaling Flink with Reactive ModeFlink Forward

Apache Arrow Flight OverviewJacques Nadeau

Deep Dive: Memory Management in Apache SparkDatabricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Making Apache Spark Better with Delta LakeDatabricks

Evening out the uneven: dealing with skew in FlinkFlink Forward

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Hive Does ACIDDataWorks Summit

OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd

Debugging Apache SparkJoey Echeverria

Flink powered stream processing platform at PinterestFlink Forward

Apache Spark on K8S Best Practice and Performance in the CloudDatabricks

Understanding Query Plans and Spark UIsDatabricks

Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica

Parquet performance tuning: the missing guideRyan Blue

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward

Sparklint @ Spark Meetup ChicagoSimon Whitear

Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Indrajit Poddar

Weitere ähnliche Inhalte

Was ist angesagt?

Making Apache Spark Better with Delta LakeDatabricks

Evening out the uneven: dealing with skew in FlinkFlink Forward

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Hive Does ACIDDataWorks Summit

OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd

Debugging Apache SparkJoey Echeverria

Flink powered stream processing platform at PinterestFlink Forward

Apache Spark on K8S Best Practice and Performance in the CloudDatabricks

Understanding Query Plans and Spark UIsDatabricks

Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica

Parquet performance tuning: the missing guideRyan Blue

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward

Was ist angesagt? (20)

Making Apache Spark Better with Delta Lake

Evening out the uneven: dealing with skew in Flink

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Hive Does ACID

OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...

Debugging Apache Spark

Flink powered stream processing platform at Pinterest

Apache Spark on K8S Best Practice and Performance in the Cloud

Understanding Query Plans and Spark UIs

Project Tungsten: Bringing Spark Closer to Bare Metal

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

File Format Benchmark - Avro, JSON, ORC & Parquet

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Optimizing Delta/Parquet Data Lakes for Apache Spark

Webinar: Deep Dive on Apache Flink State - Seth Wiesman

Parquet performance tuning: the missing guide

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Ähnlich wie Spark Summit EU talk by Simon Whitear

Sparklint @ Spark Meetup ChicagoSimon Whitear

Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Indrajit Poddar

Serverless Computing with Azure Functions Best PracticesJuan Pablo

Azure Function Best Practice Juan Pablo

Serverless Patterns by Jesse ButlerOracle Developers

Metrics-driven tuning of Apache Spark at scaleDataWorks Summit

Taking Splunk to the Next Level - Architecture Breakout SessionSplunk

Taking Splunk to the Next Level - ArchitectureSplunk

Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks

ProcessFlow/IPA DifferencesNogalis Inc

Conf2014_SearchHeadClusteringSplunk

How to Migrate Applications Off a MainframeVMware Tanzu

How to Drive Down iSeries Computing Costsmboadway

Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksThoughtworks

SpringOne 2016 in a nutshellJeroen Resoort

DoneDeal - AWS Data Analytics Platformmartinbpeters

SQL PASS Summit 2018Kendra Little

Taking Splunk to the Next Level – ArchitectureSplunk

Taking Splunk to the Next Level - TechnicalSplunk

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit

Ähnlich wie Spark Summit EU talk by Simon Whitear (20)

Sparklint @ Spark Meetup Chicago

Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...

Serverless Computing with Azure Functions Best Practices

Azure Function Best Practice

Serverless Patterns by Jesse Butler

Metrics-driven tuning of Apache Spark at scale

Taking Splunk to the Next Level - Architecture Breakout Session

Taking Splunk to the Next Level - Architecture

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

ProcessFlow/IPA Differences

Conf2014_SearchHeadClustering

How to Migrate Applications Off a Mainframe

How to Drive Down iSeries Computing Costs

Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks

SpringOne 2016 in a nutshell

DoneDeal - AWS Data Analytics Platform

SQL PASS Summit 2018

Taking Splunk to the Next Level – Architecture

Taking Splunk to the Next Level - Technical

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit

Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

Powering a Startup with Apache Spark with Kevin KimSpark Summit

Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit

How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit

Goal Based Data Production with Sim SimeonovSpark Summit

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit

Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Kürzlich hochgeladen

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Presentation of project of business person who are successPratikSingh115843

2023 Survey Shows Dip in High School E-Cigarette UseBisnar Chase Personal Injury Attorneys

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Role of Consumer Insights in business transformationAnnie Melnic

DATA ANALYSIS using various data sets like shoping data set etclalithasri22

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56

Data Analysis Project: Stroke PredictionBoston Institute of Analytics

IBEF report on the Insurance market in IndiaManalVerma4

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646

Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo

Kürzlich hochgeladen (17)

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

Insurance Churn Prediction Data Analysis Project

Digital Marketing Plan, how digital marketing works

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Presentation of project of business person who are success

2023 Survey Shows Dip in High School E-Cigarette Use

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

Role of Consumer Insights in business transformation

DATA ANALYSIS using various data sets like shoping data set etc

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Statistics For Management by Richard I. Levin 8ed.pdf

Data Analysis Project: Stroke Prediction

IBEF report on the Insurance market in India

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

Non Text Magic Studio Magic Design for Presentations L&P.pdf

Digital Indonesia Report 2024 by We Are Social .pdf

Spark Summit EU talk by Simon Whitear

1. SPARK SUMMIT EUROPE2016 Sparklint a Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster Simon Whitear Principal Engineer @ Groupon

2. Why Sparklint? • A successful Spark cluster grows rapidly • Capacity and capability mismatches arise • Leads to resource contention • Tuning process is non-trivial • Current UI operational in focus We wanted to understand application efficiency

3. Sparklint provides: • Live view of batch & streaming application stats or • Event by event analysis of historical event logs • Stats and graphs for: – Idle time – Core usage – Task locality

4. Sparklint Listener:

5. Sparklint Listener:

6. Sparklint Server:

7. Demo… • Simulated workload analyzing site access logs: – read text file as JSON – convert to Record(ip, verb, status, time) – countByIp, countByStatus, countByVerb

8. Job took 10m7s to finish Already pretty good distribution; low idle time indicates good worker usage, minimal driver node interaction in job But overallutilization is low Which is reflected in the common occurrence of the IDLE state (unused cores)

9. Job took 15m14s to finish Core usage increased, job is more efficient, execution time increased, but the app is not cpu bound

10. Job took 9m24s to finish Core utilization decreased proportionally, trading execution time for efficiency Lots of IDLE state shows we are over allocating resources

11. Job took 11m34s to finish Dynamic allocation only effective at app start due to long executorIdleTimeout setting Core utilization remains low, the config settings are not right for this workload.

12. Job took 33m5s to finish Core utilization is up, but execution time is up dramatically due to reclaiming resources before each short running task. IDLE state is reduced to a minimum, looks efficient, but execution is much slower due to dynamic allocation overhead Executor churn!

13. Job took 7m34s to finish Core utilization way up, with lower execution time Flat tops show we are becoming CPU bound Parallel execution is clearly visible in overlapping stages

14. Job took 5m6s to finish Core utilization decreases, trading execution time for efficiency again here

15. Thanks to dynamic allocation the utilization is high despite being a bi- modal application Data loading and mapping requires a large core count to get throughput Aggregation and IO of results optimized for end file size, therefore requires less cores

16. Future Features: • Increased job & stage detail in UI • History Server event sources • Inline recommendations • Auto-tuning • Streaming stage parameter delegation

17. The Credit: • Lead developer is Robert Xue • https://github.com/roboxue • SDE @ Groupon

18. Contribute! Sparklint is OSS: https://github.com/groupon/sparklint

19. SPARK SUMMIT EUROPE2016 THANK YOU. swhitear@groupon.com