SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Making Sense of Performance in Data
Analytics Frameworks
Kay Ousterhout
Joint work with Ryan Rasti, Sylvia Ratnasamy,
Scott Shenker, Byung-Gon Chun
UC Berkeley
About Me
PhD student at UC Berkeley
Thesis work centers around performance of large-scale
distributed systems
Spark PMC member
…	
  
…	
  
Task Task
Task
Task
Spark (or Hadoop/Dryad/etc.) task
…	
  
Task Task
Task
Task
Spark (or Hadoop/Dryad/etc.) task
…	
  
…	
  
Task Task
Task
Task
Task Task
Task
Task
…	
  
…	
  How can we make this job faster?
Cache input data in
memory
Task
Task
…	
  How can we make this job faster?
Cache input data in
memory
Task
Task
…	
  How can we make this job faster?
Cache input data in
memory
Optimize the network
Task
Task
Task
Task
…	
  How can we make this job faster?
Cache input data in
memory
Optimize the network
Task
Task
Task
Task
Task
Task
…	
  How can we make this job faster?
Cache input data in
memory
Optimize the network
Mitigate effect of
stragglers
Task
Task
Task: 30s
Task: 2s
Task: 3s
Task: 2s
Stragglers
Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10],
Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14]
Disk
Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]
Network
Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13]
Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys
[SIGCOMM ’14]
Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12]
In-network aggregation: Camdoop [NSDI ’12]
Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud
[SIGCOMM ’12]
Stragglers
Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10],
Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14]
Disk
Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]
Network
Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13]
Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys
[SIGCOMM ’14]
Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12]
In-network aggregation: Camdoop [NSDI ’12]
Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud
[SIGCOMM ’12]
Missing: what’s most important to
end-to-end performance?
Stragglers
Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10],
Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14]
Disk
Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]
Network
Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13]
Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys
[SIGCOMM ’14]
Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12]
In-network aggregation: Camdoop [NSDI ’12]
Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud
[SIGCOMM ’12]
Widely-accepted mantras:
Network and disk I/O are bottlenecks
Stragglers are a major issue with
unknown causes
(1)  How can we quantify performance bottlenecks?
Blocked time analysis
(2) Do the mantras hold?
Takeaways based on three workloads run
with Spark
This work
Takeaways based on three Spark workloads:
Network optimizations
can reduce job completion time by at most 2%
CPU (not I/O) often the bottleneck
<19% reduction in completion time from optimizing disk
Many straggler causes can be identified and
fixed
Takeaways will not hold
for every single analytics workload
nor for all time	
  
Accepted mantras are often not true
Methodology to avoid performance
misunderstandings in the future
This work:
Outline
•  Methodology: How can we measure Spark bottlenecks?
•  Workloads: What workloads did we use?
•  Results: How well do the mantras hold?
•  Why?: Why do our results differ from past work?
•  Demo: How can you understand your own workload?
Outline
•  Methodology: How can we measure Spark bottlenecks?
•  Workloads: What workloads did we use?
•  Results: How well do the mantras hold?
•  Why?: Why do our results differ from past work?
•  Demo: How can you understand your own workload?
What’s the job’s bottleneck?
What exactly happens in a Spark task?
Task reads shuffle data, generates in-memory output
compute
network
time
(1) Request a few
shuffle blocks
disk
(2) Start processing
local data
(3) Process data
fetched remotely
(4) Continue fetching
remote data
: time to handle
one shuffle block
What’s the bottleneck for this task?
Task reads shuffle data, generates in-memory output
compute
network
time
disk
Bottlenecked on
network and disk
Bottlenecked
on network
Bottlenecked on
CPU
What’s the bottleneck for the job?
time
tasks
compute
network
disk
Task x: may be bottlenecked
on different resources at
different times
Time t: different tasks may be
bottlenecked on different resources
How does network affect the job’s completion
time?
time
tasks
:Time when task is
blocked on the
network
Blocked time analysis: how much faster would the
job complete if tasks never blocked on the network?
Blocked time analysis
tasks
(2) Simulate how job completion
time would change
(1) Measure time
when tasks are
blocked on the
network
(1) Measure time when tasks are blocked on
network
compute
network
disk
: time blocked
on network
: time blocked
on disk
Original task runtime
task runtime if network were infinitely fastBest case
(2) Simulate how job completion time would
change
Task 0
Task 1
Task 2
time
2slots
to: Original job completion time
Task 0
Task 1
Task 2
2slots
Incorrectly computed time: doesn’t
account for task scheduling
: time blocked
on network
(2) Simulate how job completion time would
change
Task 0
Task 1
Task 2
time
2slots
to: Original job completion time
Task 0
Task 1 Task 2
2slots
: time blocked
on network
tn: Job completion time with infinitely fast network
Blocked time analysis: how quickly
could a job have completed if a resource
were infinitely fast?
Outline
•  Methodology: How can we measure Spark bottlenecks?
•  Workloads: What workloads did we use?
•  Results: How well do the mantras hold?
•  Why?: Why do our results differ from past work?
•  Demo: How can you understand your own workload?
Large-scale traces?
Don’t have enough instrumentation for
blocked-time analysis
SQL Workloads run on Spark
TPC-DS (20 machines, 850GB;
60 machines, 2.5TB; 200 machines, 2.5TB)
Big Data Benchmark (5 machines, 60GB)
Databricks (Production; 9 machines, tens of GB)
2 versions of each: in-memory, on-disk
Only 3 workloads
Small cluster sizes
Outline
•  Methodology: How can we measure Spark bottlenecks?
•  Workloads: What workloads did we use?
•  Results: How well do the mantras hold?
•  Why?: Why do our results differ from past work?
•  Demo: How can you understand your own workload?
How much faster could jobs get from optimizing
network performance?
How much faster could jobs get from optimizing
network performance?
5
95
75
25
50
Percentiles
Median improvement: 2%
95%ile improvement: 10%
How much faster could jobs get from optimizing
network performance?
Median improvement at most 2%
5
95
75
25
50
Percentiles
How much faster could jobs get from optimizing
disk performance?
Median improvement at most 19%
How important is CPU?
CPU much more highly utilized than
disk or network!
What about stragglers?
5-10% improvement from eliminating stragglers
Based on simulation
Can explain >60% of stragglers in >75% of jobs
Fixing underlying cause can speed up other tasks too!
2x speedup from fixing one straggler cause
Takeaways based on three Spark workloads:
Network optimizations
can reduce job completion time by at most 2%
CPU (not I/O) often the bottleneck
<19% reduction in completion time from optimizing disk
Many straggler causes can be identified and
fixed
Outline
•  Methodology: How can we measure Spark bottlenecks?
•  Workloads: What workloads did we use?
•  Results: How well do the mantras hold?
•  Why?: Why do our results differ from past work?
•  Demo: How can you understand your own workload?
network
>
Why are our results so different than what’s
stated in prior work?
Are the workloads we measured unusually
network-light?
How can we compare our workloads to large-
scale traces used to motivate prior work?
How much data is transferred per CPU second?
Microsoft ’09-’10: 1.9–6.35 Mb / task second
Google ’04-‘07: 1.34–1.61 Mb / machine second
Why are our results so different than what’s
stated in prior work?
Our workloads are network light
1)  Incomplete metrics
2)  Conflation of CPU and network time
When is the network used?
map
task
map
task
map
task
…
reduce
task
reduce
task
reduce
task
…
Input data
(read
locally)
Output
data
(1) To shuffle
intermediate
data
(2) To
replicate
output data
Some work
focuses only on
the shuffle
How does the data transferred over the network
compare to the input data?
Not realistic to look only at shuffle!
Or to use workloads where all input is shuffled
Shuffled data is only
~1/3 of input data!
Even less output data
Prior work conflates CPU and network time
To send data over network:
(1) Serialize objects into
bytes
(2) Send bytes
(1) and (2) often conflated.
Reducing application data sent reduces both!
When does the network matter?
Network important when:
(1)  Computation optimized
(2)  Serialization time low
(3)  Large amount of data sent
over network
Why are our results so different than what’s
stated in prior work?
Our workloads are network light
1) Incomplete metrics
e.g., looking only at shuffle time
2) Conflation of CPU and network time
Sending data over the network has an associated CPU cost
Limitations
Only three workloads
Industry-standard workloads
Results sanity-checked with larger production traces
Small cluster sizes
Results don’t change when we move between cluster sizes
Limitations aren’t fatal
Only three workloads
Industry-standard workloads
Results sanity-checked with larger production traces
Small cluster sizes
Takeaways don’t change when we move between cluster
sizes
Outline
•  Methodology: How can we measure Spark bottlenecks?
•  Workloads: What workloads did we use?
•  Results: How well do the mantras hold?
•  Why?: Why do our results differ from past work?
•  Demo: How can you understand your own workload?
Demo
Demo
Often can tune parameters to shift the bottleneck
(e.g., change snappy to lzf)
What’s missing from Spark metrics?
Time blocked on reading input data and writing output
data (HADOOP-11873)
Time spent spilling intermediate data to disk
(SPARK-3577)
Network optimizations
can reduce job completion time by at most 2%
CPU (not I/O) often the bottleneck
<19% reduction in completion time from optimizing disk
Many straggler causes can be identified and fixed
All traces and tools publicly available:
tinyurl.com/summit-traces
Takeaway: performance understandability should
be a first-class concern!
(almost) All Instrumentation now part of Spark
I want your workload! keo@eecs.berkeley.edu

Weitere ähnliche Inhalte

Was ist angesagt?

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...Spark Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Sigmoid
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Spark Summit
 

Was ist angesagt? (20)

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 

Andere mochten auch

How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining MeetupDan Crankshaw
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsVillu Ruusmann
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingSergey Arkhipov
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceSergey Arkhipov
 
The High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian OzsvaldThe High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian OzsvaldPyData
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationGlobalLogic Ukraine
 
Spark Infrastructure Made Easy
Spark Infrastructure Made EasySpark Infrastructure Made Easy
Spark Infrastructure Made EasyBlueData, Inc.
 
Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLVillu Ruusmann
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning晨揚 施
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Python profiling
Python profilingPython profiling
Python profilingdreampuf
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibSpark Summit
 

Andere mochten auch (20)

How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining Meetup
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python Profiling
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
 
The High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian OzsvaldThe High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian Ozsvald
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
 
Spark Infrastructure Made Easy
Spark Infrastructure Made EasySpark Infrastructure Made Easy
Spark Infrastructure Made Easy
 
Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Python profiling
Python profilingPython profiling
Python profiling
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
 

Ähnlich wie Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentDatabricks
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pigjavicid
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)Doug Burns
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Tek12: Graphing real-time performance with Graphite
Tek12: Graphing real-time performance with GraphiteTek12: Graphing real-time performance with Graphite
Tek12: Graphing real-time performance with Graphitenanderoo
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 

Ähnlich wie Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley) (20)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Tek12: Graphing real-time performance with Graphite
Tek12: Graphing real-time performance with GraphiteTek12: Graphing real-time performance with Graphite
Tek12: Graphing real-time performance with Graphite
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Kürzlich hochgeladen (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)

  • 1. Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC Berkeley
  • 2. About Me PhD student at UC Berkeley Thesis work centers around performance of large-scale distributed systems Spark PMC member
  • 3. …   …   Task Task Task Task Spark (or Hadoop/Dryad/etc.) task
  • 4. …   Task Task Task Task Spark (or Hadoop/Dryad/etc.) task …  
  • 5. …   Task Task Task Task Task Task Task Task …  
  • 6. …  How can we make this job faster? Cache input data in memory Task Task
  • 7. …  How can we make this job faster? Cache input data in memory Task Task
  • 8. …  How can we make this job faster? Cache input data in memory Optimize the network Task Task Task Task
  • 9. …  How can we make this job faster? Cache input data in memory Optimize the network Task Task Task Task Task Task
  • 10. …  How can we make this job faster? Cache input data in memory Optimize the network Mitigate effect of stragglers Task Task Task: 30s Task: 2s Task: 3s Task: 2s
  • 11. Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Disk Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Network Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12]
  • 12. Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Disk Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Network Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] Missing: what’s most important to end-to-end performance?
  • 13. Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Disk Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Network Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] Widely-accepted mantras: Network and disk I/O are bottlenecks Stragglers are a major issue with unknown causes
  • 14. (1)  How can we quantify performance bottlenecks? Blocked time analysis (2) Do the mantras hold? Takeaways based on three workloads run with Spark This work
  • 15. Takeaways based on three Spark workloads: Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck <19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed
  • 16. Takeaways will not hold for every single analytics workload nor for all time  
  • 17. Accepted mantras are often not true Methodology to avoid performance misunderstandings in the future This work:
  • 18. Outline •  Methodology: How can we measure Spark bottlenecks? •  Workloads: What workloads did we use? •  Results: How well do the mantras hold? •  Why?: Why do our results differ from past work? •  Demo: How can you understand your own workload?
  • 19. Outline •  Methodology: How can we measure Spark bottlenecks? •  Workloads: What workloads did we use? •  Results: How well do the mantras hold? •  Why?: Why do our results differ from past work? •  Demo: How can you understand your own workload?
  • 20. What’s the job’s bottleneck?
  • 21. What exactly happens in a Spark task? Task reads shuffle data, generates in-memory output compute network time (1) Request a few shuffle blocks disk (2) Start processing local data (3) Process data fetched remotely (4) Continue fetching remote data : time to handle one shuffle block
  • 22. What’s the bottleneck for this task? Task reads shuffle data, generates in-memory output compute network time disk Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU
  • 23. What’s the bottleneck for the job? time tasks compute network disk Task x: may be bottlenecked on different resources at different times Time t: different tasks may be bottlenecked on different resources
  • 24. How does network affect the job’s completion time? time tasks :Time when task is blocked on the network Blocked time analysis: how much faster would the job complete if tasks never blocked on the network?
  • 25. Blocked time analysis tasks (2) Simulate how job completion time would change (1) Measure time when tasks are blocked on the network
  • 26. (1) Measure time when tasks are blocked on network compute network disk : time blocked on network : time blocked on disk Original task runtime task runtime if network were infinitely fastBest case
  • 27. (2) Simulate how job completion time would change Task 0 Task 1 Task 2 time 2slots to: Original job completion time Task 0 Task 1 Task 2 2slots Incorrectly computed time: doesn’t account for task scheduling : time blocked on network
  • 28. (2) Simulate how job completion time would change Task 0 Task 1 Task 2 time 2slots to: Original job completion time Task 0 Task 1 Task 2 2slots : time blocked on network tn: Job completion time with infinitely fast network
  • 29. Blocked time analysis: how quickly could a job have completed if a resource were infinitely fast?
  • 30. Outline •  Methodology: How can we measure Spark bottlenecks? •  Workloads: What workloads did we use? •  Results: How well do the mantras hold? •  Why?: Why do our results differ from past work? •  Demo: How can you understand your own workload?
  • 31. Large-scale traces? Don’t have enough instrumentation for blocked-time analysis
  • 32. SQL Workloads run on Spark TPC-DS (20 machines, 850GB; 60 machines, 2.5TB; 200 machines, 2.5TB) Big Data Benchmark (5 machines, 60GB) Databricks (Production; 9 machines, tens of GB) 2 versions of each: in-memory, on-disk Only 3 workloads Small cluster sizes
  • 33. Outline •  Methodology: How can we measure Spark bottlenecks? •  Workloads: What workloads did we use? •  Results: How well do the mantras hold? •  Why?: Why do our results differ from past work? •  Demo: How can you understand your own workload?
  • 34. How much faster could jobs get from optimizing network performance?
  • 35. How much faster could jobs get from optimizing network performance? 5 95 75 25 50 Percentiles Median improvement: 2% 95%ile improvement: 10%
  • 36. How much faster could jobs get from optimizing network performance? Median improvement at most 2% 5 95 75 25 50 Percentiles
  • 37. How much faster could jobs get from optimizing disk performance? Median improvement at most 19%
  • 38. How important is CPU? CPU much more highly utilized than disk or network!
  • 39. What about stragglers? 5-10% improvement from eliminating stragglers Based on simulation Can explain >60% of stragglers in >75% of jobs Fixing underlying cause can speed up other tasks too! 2x speedup from fixing one straggler cause
  • 40. Takeaways based on three Spark workloads: Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck <19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed
  • 41. Outline •  Methodology: How can we measure Spark bottlenecks? •  Workloads: What workloads did we use? •  Results: How well do the mantras hold? •  Why?: Why do our results differ from past work? •  Demo: How can you understand your own workload? network >
  • 42. Why are our results so different than what’s stated in prior work? Are the workloads we measured unusually network-light? How can we compare our workloads to large- scale traces used to motivate prior work?
  • 43. How much data is transferred per CPU second? Microsoft ’09-’10: 1.9–6.35 Mb / task second Google ’04-‘07: 1.34–1.61 Mb / machine second
  • 44. Why are our results so different than what’s stated in prior work? Our workloads are network light 1)  Incomplete metrics 2)  Conflation of CPU and network time
  • 45. When is the network used? map task map task map task … reduce task reduce task reduce task … Input data (read locally) Output data (1) To shuffle intermediate data (2) To replicate output data Some work focuses only on the shuffle
  • 46. How does the data transferred over the network compare to the input data? Not realistic to look only at shuffle! Or to use workloads where all input is shuffled Shuffled data is only ~1/3 of input data! Even less output data
  • 47. Prior work conflates CPU and network time To send data over network: (1) Serialize objects into bytes (2) Send bytes (1) and (2) often conflated. Reducing application data sent reduces both!
  • 48. When does the network matter? Network important when: (1)  Computation optimized (2)  Serialization time low (3)  Large amount of data sent over network
  • 49. Why are our results so different than what’s stated in prior work? Our workloads are network light 1) Incomplete metrics e.g., looking only at shuffle time 2) Conflation of CPU and network time Sending data over the network has an associated CPU cost
  • 50. Limitations Only three workloads Industry-standard workloads Results sanity-checked with larger production traces Small cluster sizes Results don’t change when we move between cluster sizes
  • 51. Limitations aren’t fatal Only three workloads Industry-standard workloads Results sanity-checked with larger production traces Small cluster sizes Takeaways don’t change when we move between cluster sizes
  • 52. Outline •  Methodology: How can we measure Spark bottlenecks? •  Workloads: What workloads did we use? •  Results: How well do the mantras hold? •  Why?: Why do our results differ from past work? •  Demo: How can you understand your own workload?
  • 53. Demo
  • 54. Demo Often can tune parameters to shift the bottleneck (e.g., change snappy to lzf)
  • 55. What’s missing from Spark metrics? Time blocked on reading input data and writing output data (HADOOP-11873) Time spent spilling intermediate data to disk (SPARK-3577)
  • 56. Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck <19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed All traces and tools publicly available: tinyurl.com/summit-traces Takeaway: performance understandability should be a first-class concern! (almost) All Instrumentation now part of Spark I want your workload! keo@eecs.berkeley.edu