Improving Apache Spark for Dynamic Allocation and Spot Instances

•

1 gefällt mir•295 views

This presentation will explore the new work in Spark 3.1 adding the concept of graceful decommissioning and how we can use this to improve Spark’s performance in both dynamic allocation and spot/preemptable instances. Together we’ll explore how Spark’s dynamic allocation has evolved over time, and why the different changes have been needed. We’ll also look at the multi-company collaboration that resulted in being able to deliver this feature and I’ll end with encouraging pointers on how to get more involved in Spark’s development.

Daten & Analysen

Who am I?
• Holden Kara
u

• She / he
r

• Apache Spark PMC
• Contributor to a lot of other projects
• co-author of High Performance
Spark, Learning Spark, and Kubeflow
for Machine Learning
• http://bit.ly/holdenSparkVideos
• https://youtube.com/user/holdenkarau

Let us start at the beginning
• Spark achieves resilience through re-computation which is part of how we go fas
• This poses challenges with removing executors that may contain dat
• We "solved" it for YARN/Mesos back in the da
• I drank waaaay too much coffee and came up with an alternativ
• But no one really liked it because we didn't need it so I closed the Google doc and
forgot about i
t

• Don’t worry, we’ll get to the code soon :)

But then….
• The "cloud" became really popula
r

• Kubernetes became popula
r

• Everything caught on fire :/

Our Protagonist Remembers
• I started drinking a lot of coffee

• We dusted off that old design and wrote
some cod
e

• And then I got hit by a ca
r

• More people wrote more cod
e

• We had a VOT
E

• We wrote waaaaay more cod
e

• Everyone lived happily ever after?
Photo by Lukas from Pexels

How did DA work on YARN?
• Scale up is "easy" (add more
resources
)

• Scale down required a stay resident
program to be on each YARN node to
serve any file
s

• Spark stored it's shuffle data as file
s

• Persist in memory data was still lost
when scaling down an executor
Photo by Markus Spiske from Pexels

Why did the cloud impact this?
• If you wanted a ~50% cost saving of
spot/preemptible instances you might
lose entire machine
s

• Yes Spark can "handle" this, but does
so by recomputing data (expensive
)

• You can't depend on leaving a program
around to serve files when the server is
just gon
e

• So we need to find a way to migrate the
data

Ok sure the cloud, but K8s?
• Kubernetes doesn't like like the idea of
scheduling a stay resident program on
every nod
e

• Also most people don't like the idea of
shared disk here either (accros jobs/
users
)

• So we need to find a way to migrate the
data

SPARK-20624
• Yee-haw
!

• Ok but more seriously how does it work? Great question lets open up the code
• BlockManagerDecomissioner.scala is where most of the magic happens

Collaboration
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-
Decommissioning-SPIP-td29701.htm
l

https://github.com/apache/spark/pulls?q=is%3Apr+decommission+is%3Aclosed+

Ok what about the car?
Getting hit by a car sucks a lot
Slowed down dev work while I did rehab to be able
to walk & type again
Shout out to everyone who helped me recover
(from my wife, girlfriend, partners, my friends, to
the hospital staff, nursing home, PT, OT,
Ambulance, my employer for giving me time off,
the Spark community for understanding I needed
time off <3)

It’s early though so please be careful
On a Happy Note: You can try this now
• Enable the followin
g

- spark.decommission.enabled

- spark.storage.decommission.enabled

- spark.storage.decommission.rddBlocks.enabled
- spark.storage.decommission.shuffleBlocks.enabled
• Want to get fancy? Optionally enable:

- spark.shuffle.externalStorage.enabled

- And configure a storage backend ( spark.shuffle.externalStorage.backend)

Future work
• Heuristics to migrate dat
a

• Improve container pre-emption selectio
• Better heuristics around when to scale up and down containers

TM and © 2021 Apple Inc. All rights reserved.

Empfohlen

Deep Dive: Memory Management in Apache SparkDatabricks

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Understanding Query Plans and Spark UIsDatabricks

Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks

Spark shuffle introductioncolorant

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks

Empfohlen

Deep Dive: Memory Management in Apache SparkDatabricks

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Understanding Query Plans and Spark UIsDatabricks

Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks

Spark shuffle introductioncolorant

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks

Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks

Deep Dive into the New Features of Apache Spark 3.0Databricks

Physical Plans in Spark SQLDatabricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb

Common Strategies for Improving Performance on Your Delta LakehouseDatabricks

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Why your Spark Job is FailingDataWorks Summit

Parquet performance tuning: the missing guideRyan Blue

Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks

Performance Optimizations in Apache ImpalaCloudera, Inc.

SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Leveraging Databricks for Spark PipelinesRose Toomey

Leveraging Databricks for Spark pipelinesRose Toomey

Weitere ähnliche Inhalte

Was ist angesagt?