SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Downscaling: The Achilles heel of
Autoscaling Apache Spark Clusters
Prakhar Jain
Venkata Sowrirajan
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Why Autoscaling on cloud?
• How nodes in spark cluster are used?
• Scale up easy, scale down difficult
• Optimizations
Autoscaling on cloud
• Cloud for compute provides elasticity
– Launch nodes when required
– Take them away when you are done
– Pay-as-you-go model. No long term commitments.
• Autoscaling clusters are needed to use this elastic nature of the cloud
– Add nodes to the cluster when required
– Remove nodes from the cluster when the cluster utilization is low
• Use Cloud object stores to store the actual data and just use the
elastic clusters on the cloud for data processing/ML etc
How are nodes used?
Nodes/Instances in a Spark cluster are used for
• Compute
– Executors are launched on these nodes which do the actual processing of Data
• Intermediate temporary data
– Nodes are also used as temporary storage e.g. for storing temporary application
related shuffle/cache data
– Writing temporary data to object store (like s3 etc) deteriorates the overall
performance of the application
Upscale easy, downscale difficult
• Upscaling a cluster on cloud is easy
– When the workload on the cluster is high, simply add more nodes
– Can be achieved using simple Load balancer
• Downscaling nodes are difficult
– No running containers
– No shuffle/cache data stored on disks
– Container fragmentation within cluster nodes
– Some nodes have no containers running but are used for storage and vice versa
Factors affecting node downscaling
Terminology
Any cluster generally comprises of following entities:
• Resource Manager
– Administrator for allocating and managing resources in a cluster. e.g.
YARN/Mesos etc
• Application Driver
– Brain of the application
– Interacts with Resource Scheduler and negotiates for resources
• Ask for executors when needed
• Release executors when not needed
– e.g. Spark/Tez/MR etc
• Executor
– Actual worker responsible for running smallest unit of execution - task
Current resource allocation
strategy
1 2 3
Driver
Problem: Executors fragmentation
Current allocation strategy
allocates on emptier nodes first
Can we improve?
● Packing of executors
Low Usage Medium Usage High Usage
1 32
Job 5Job...Job n Job 1Job 2Job 3Job 4
Priority in which jobs are allocated to nodes in Qubole Model
Jobs are prevented from
being assigned first to
low usage nodes, instead
priority is given to medium
usage nodes.This
ensures that low usage
nodes can be
downscaled.
Low Usage Medium Usage High Usage
1 32
Job 5Job...Job n Job 3Job 4
Terminated
Nodes
Job 1Job 2
Job 1 & 2 allocated to medium usage nodes and these
nodes are moved into high usage category as the
utilization increases due to these new jobs
In the meanwhile, once
the tasks in the low usage
nodes are completed, the
node is freed up for
termination.
Cost Savings
Low Usage Medium Usage High Usage
1 32
Job 15Job n
Downscaled
Nodes
Job...
More jobs (3-14) are allocated to
medium usage nodes and these
nodes are moved into high usage
category as the usage increases
due to these new jobs
Cost Savings
As more tasks complete
more nodes are made
available for downscaling.
Low Usage Medium Usage High Usage
1 32
Job 21
Terminated
Nodes
Job n
As medium usage nodes are
reduced, jobs are allocated to
“Low Usage” nodes and
these nodes are moved into
the “Medium Usage” Nodes
Cost Savings
Low Usage Medium Usage High Usage
1 32
Terminated
Nodes
Job n
As jobs complete these
nodes are moved to
“Medium Usage” and “Low
Usage” nodes.
Cost Savings
Example revisited with new allocation
strategy
1 2 3
Eligible for downscaling
Driver
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
Downscale issues with Min Executors
Driver
1 2 3 4
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
Min executors distribution without
packing
Driver
1 2 3 4
Min executors distribution with packing
Driver
Rotate/refresh executors by killing them
and let resource scheduler do packing to
defragment the cluster
1 2 3 4
Nodes eligible for downscaling
How Shuffle data is produced / consumed?
How Shuffle data is produced / consumed?
Stage-1 (mapper
stage) with 3 tasks
----------
Stage-2 (reducer
stage) with 2 tasks
Can't downscale
executor 3
Since reducer
stage needs
shuffle data
generated by all
mappers, so
corresponding
executors needs
to be UP.
Problem: Executor can't be removed
until it holds any useful shuffle data
External Shuffle Service
• Root cause of problem: Executor which generated shuffle data is also
responsible for serving it. This ties shuffle data with executor
• Solution: Offload the responsibility of serving shuffle data to external
service
External Shuffle Service
This executor can
be removed as it
is idle
External Shuffle Service
• One ESS per node
– Responsible for serving shuffle data generated by any executor on that node
– Once the executor is idle, it can be taken away
• At Qubole:
– Once the node doesn't have any containers and ESS reports no shuffle data =>
node is downscaled
ESS at Qubole
• Also tracks information about presence of shuffle
data on the node
– This information is useful taking decision about node
downscaling
Recap
• Till now we have seen
– How to schedule executors using YARN-executor-
packing scheduling strategy
– How to re-pack min executors
– How to use External shuffle service (ESS) to
downscale executors
• What about shuffle data?
??
Shuffle Cleanup
• Shuffle data is deleted at the end of application by ESS
– In long running Spark applications (ex. interactive notebooks), it keeps on
accumulating
– Results in poor node downscaling
• Can it be deleted before end of application?
– What shuffle files are useful at a point of time?
Issues with long running applications
Master
E
S
S
E
S
S
APP1 - Exec4
APP1 - Exec5
E
S
S
APP1 - Driver
APP1 - Exec1
APP1 - Exec2
App 1 started
on cluster
with 2 initial
executors
APP1 - Exec6
APP1 - Exec8
APP1 - Exec9
APP1 - Exec10
APP1 - Exec3 APP1 - Exec7 APP1 - Exec11
App1 doesn't
need extra
executors
anymore -
downscaling
everything other
than min
executors (say
2)
App 1 asked for more
executors -
2 new workers brought up
multiple new executors added
Assume shuffle
data was generated
by tasks that ran on
this node.
This shuffle data
will be cleaned up
at the end of
application.
Problem: Node can't be taken away
from cluster till the application ends
Shuffle reuse in Spark
Skipped
Shuffle Cleanup
• Dataframe goes out of scope => shuffle data cannot be accessed
anymore
– Delete shuffle files when that dataframe goes out of scope
• Helps us in downscaling by making sure that unnecessary shuffle
data is deleted
– Saw 30-40% downscaling improvements
• SPARK-4287
Disaggregation of Compute and Storage
• To utilize full elasticity of the cloud, we have to
disaggregate the compute and storage
• Move shuffle data somewhere else?
– Requirement: Highly available shared storage service
– Use "Amazon FSx for Lustre" or similar services on other clouds
Downscaling a Node
Spark - Disaggregation of Compute and
Storage
• Mount some NFS endpoint on all the nodes of cluster
• Change shuffle manager in Spark to something which can read/write
shuffle from NFS mount point
– Splash (Opensource Apache 2.0 project) provides shuffle manager
implementation for shared filesystem
– Mappers/Reducers will write/read shuffle data to/from NFS via Splash
• SPARK-25299 (Use remote storage for persisting shuffle data) - In progress.
Summary and Future Work
• Different ways to improve downscaling
– Executor packing strategy and periodic executor refresh
– Use External Shuffle Service
– Faster Shuffle cleanup
– Disaggregate compute and storage
• Future Work: Offload shuffle data only when needed
– By default use local disk to read/write shuffle data
– When node is not used for compute, shift shuffle data to NFS
– Better downscaling without comprising much on performance
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaDaria Litvinov
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino ProjectMartin Traverso
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Databricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 

Was ist angesagt? (20)

Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and Kafka
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 

Ähnlich wie Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters

Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersAnant Corporation
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
4 memory management bb
4   memory management bb4   memory management bb
4 memory management bbShahid Riaz
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Alpine Data
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 

Ähnlich wie Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters (20)

Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource Managers
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
4 memory management bb
4   memory management bb4   memory management bb
4 memory management bb
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

Kürzlich hochgeladen (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters

  • 1. Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters Prakhar Jain Venkata Sowrirajan #UnifiedDataAnalytics #SparkAISummit
  • 2. Agenda • Why Autoscaling on cloud? • How nodes in spark cluster are used? • Scale up easy, scale down difficult • Optimizations
  • 3. Autoscaling on cloud • Cloud for compute provides elasticity – Launch nodes when required – Take them away when you are done – Pay-as-you-go model. No long term commitments. • Autoscaling clusters are needed to use this elastic nature of the cloud – Add nodes to the cluster when required – Remove nodes from the cluster when the cluster utilization is low • Use Cloud object stores to store the actual data and just use the elastic clusters on the cloud for data processing/ML etc
  • 4. How are nodes used? Nodes/Instances in a Spark cluster are used for • Compute – Executors are launched on these nodes which do the actual processing of Data • Intermediate temporary data – Nodes are also used as temporary storage e.g. for storing temporary application related shuffle/cache data – Writing temporary data to object store (like s3 etc) deteriorates the overall performance of the application
  • 5. Upscale easy, downscale difficult • Upscaling a cluster on cloud is easy – When the workload on the cluster is high, simply add more nodes – Can be achieved using simple Load balancer • Downscaling nodes are difficult – No running containers – No shuffle/cache data stored on disks – Container fragmentation within cluster nodes – Some nodes have no containers running but are used for storage and vice versa
  • 7. Terminology Any cluster generally comprises of following entities: • Resource Manager – Administrator for allocating and managing resources in a cluster. e.g. YARN/Mesos etc • Application Driver – Brain of the application – Interacts with Resource Scheduler and negotiates for resources • Ask for executors when needed • Release executors when not needed – e.g. Spark/Tez/MR etc • Executor – Actual worker responsible for running smallest unit of execution - task
  • 8. Current resource allocation strategy 1 2 3 Driver Problem: Executors fragmentation Current allocation strategy allocates on emptier nodes first
  • 9. Can we improve? ● Packing of executors
  • 10. Low Usage Medium Usage High Usage 1 32 Job 5Job...Job n Job 1Job 2Job 3Job 4 Priority in which jobs are allocated to nodes in Qubole Model Jobs are prevented from being assigned first to low usage nodes, instead priority is given to medium usage nodes.This ensures that low usage nodes can be downscaled.
  • 11. Low Usage Medium Usage High Usage 1 32 Job 5Job...Job n Job 3Job 4 Terminated Nodes Job 1Job 2 Job 1 & 2 allocated to medium usage nodes and these nodes are moved into high usage category as the utilization increases due to these new jobs In the meanwhile, once the tasks in the low usage nodes are completed, the node is freed up for termination. Cost Savings
  • 12. Low Usage Medium Usage High Usage 1 32 Job 15Job n Downscaled Nodes Job... More jobs (3-14) are allocated to medium usage nodes and these nodes are moved into high usage category as the usage increases due to these new jobs Cost Savings As more tasks complete more nodes are made available for downscaling.
  • 13. Low Usage Medium Usage High Usage 1 32 Job 21 Terminated Nodes Job n As medium usage nodes are reduced, jobs are allocated to “Low Usage” nodes and these nodes are moved into the “Medium Usage” Nodes Cost Savings
  • 14. Low Usage Medium Usage High Usage 1 32 Terminated Nodes Job n As jobs complete these nodes are moved to “Medium Usage” and “Low Usage” nodes. Cost Savings
  • 15. Example revisited with new allocation strategy 1 2 3 Eligible for downscaling Driver
  • 17. #UnifiedDataAnalytics #SparkAISummit #UnifiedDataAnalytics #SparkAISummit Min executors distribution without packing Driver 1 2 3 4
  • 18. Min executors distribution with packing Driver Rotate/refresh executors by killing them and let resource scheduler do packing to defragment the cluster 1 2 3 4 Nodes eligible for downscaling
  • 19. How Shuffle data is produced / consumed?
  • 20. How Shuffle data is produced / consumed? Stage-1 (mapper stage) with 3 tasks ---------- Stage-2 (reducer stage) with 2 tasks Can't downscale executor 3 Since reducer stage needs shuffle data generated by all mappers, so corresponding executors needs to be UP. Problem: Executor can't be removed until it holds any useful shuffle data
  • 21. External Shuffle Service • Root cause of problem: Executor which generated shuffle data is also responsible for serving it. This ties shuffle data with executor • Solution: Offload the responsibility of serving shuffle data to external service
  • 22. External Shuffle Service This executor can be removed as it is idle
  • 23. External Shuffle Service • One ESS per node – Responsible for serving shuffle data generated by any executor on that node – Once the executor is idle, it can be taken away • At Qubole: – Once the node doesn't have any containers and ESS reports no shuffle data => node is downscaled
  • 24. ESS at Qubole • Also tracks information about presence of shuffle data on the node – This information is useful taking decision about node downscaling
  • 25. Recap • Till now we have seen – How to schedule executors using YARN-executor- packing scheduling strategy – How to re-pack min executors – How to use External shuffle service (ESS) to downscale executors • What about shuffle data? ??
  • 26. Shuffle Cleanup • Shuffle data is deleted at the end of application by ESS – In long running Spark applications (ex. interactive notebooks), it keeps on accumulating – Results in poor node downscaling • Can it be deleted before end of application? – What shuffle files are useful at a point of time?
  • 27. Issues with long running applications Master E S S E S S APP1 - Exec4 APP1 - Exec5 E S S APP1 - Driver APP1 - Exec1 APP1 - Exec2 App 1 started on cluster with 2 initial executors APP1 - Exec6 APP1 - Exec8 APP1 - Exec9 APP1 - Exec10 APP1 - Exec3 APP1 - Exec7 APP1 - Exec11 App1 doesn't need extra executors anymore - downscaling everything other than min executors (say 2) App 1 asked for more executors - 2 new workers brought up multiple new executors added Assume shuffle data was generated by tasks that ran on this node. This shuffle data will be cleaned up at the end of application. Problem: Node can't be taken away from cluster till the application ends
  • 28. Shuffle reuse in Spark Skipped
  • 29. Shuffle Cleanup • Dataframe goes out of scope => shuffle data cannot be accessed anymore – Delete shuffle files when that dataframe goes out of scope • Helps us in downscaling by making sure that unnecessary shuffle data is deleted – Saw 30-40% downscaling improvements • SPARK-4287
  • 30. Disaggregation of Compute and Storage • To utilize full elasticity of the cloud, we have to disaggregate the compute and storage • Move shuffle data somewhere else? – Requirement: Highly available shared storage service – Use "Amazon FSx for Lustre" or similar services on other clouds
  • 32. Spark - Disaggregation of Compute and Storage • Mount some NFS endpoint on all the nodes of cluster • Change shuffle manager in Spark to something which can read/write shuffle from NFS mount point – Splash (Opensource Apache 2.0 project) provides shuffle manager implementation for shared filesystem – Mappers/Reducers will write/read shuffle data to/from NFS via Splash • SPARK-25299 (Use remote storage for persisting shuffle data) - In progress.
  • 33. Summary and Future Work • Different ways to improve downscaling – Executor packing strategy and periodic executor refresh – Use External Shuffle Service – Faster Shuffle cleanup – Disaggregate compute and storage • Future Work: Offload shuffle data only when needed – By default use local disk to read/write shuffle data – When node is not used for compute, shift shuffle data to NFS – Better downscaling without comprising much on performance