The Hidden Life of Spark Jobs

The Hidden Life
of Spark Jobs
Paweł Leszczyński

“There are more life forms in a handful of
forest soil than there are people on the
planet.”

Driver
PySpark
Executor Executor Executor
ExecutorExecutor

Driver
PySpark using only Spark API
Executor Executor Executor
ExecutorExecutor

Narrow transformations
Filter Joins with co-partitioned inpytsJoin with co-opartitioned inputs

Wide transformations
Group By Key Joins with co-partitioned inpytsJoin with not co-opartitioned inputs

Spark WEB UI - possible executor issues
● Active / Dead
● GC Time
● Input Size
● Shuffle size
● Disk Used

“If a tree falls in the forest there are
other trees listening.”

Preempted executors
● Containers killed by YARN (exceeding soft limits on queue)
● More executors does not have to speed up the processing!
● Check Yarn logs.

Executor memory
spark.executor.memory
spark.executor.memoryOverhead
Storage
Execution
Overhead
Yarncontainermemory

“An organism that is too greedy and takes too
much without giving anything in return
destroys what it needs for life.”

Driver Issues
● Collecting to much data
spark.driver.memory 1G
spark.driver.memoryOverhead spark.driver.memory * 0.10 + 384M
spark.driver.maxResultSize 1G
spark.driver.cores 1

Putting all debug information together
Spark Web UI
Available when a job is
running. SparkHistory
Server enables to view
Spark Web UI after job
finishes
What data is processed?
How does execution plan
look like?
Any data skews?
Application logs
Check application logs for
possible errors and
exceptions in a job logic.
yarn logs --applicationId
Yarn Internal Logs
Check the logs of Yarn
containers processing the
job.
Do they get preempted?
Do they exceed container
memory limit?

CSV or JSON processing takes way too long
val samplingRatio =
parameters.get("samplingRatio")
.map(_.toDouble).getOrElse(1.0)

Dynamic Resource Allocation vs Cache
spark.dynamicAllocation.cachedExecutorIdleTimeout
default: infinity

Executor 1
Executor 2
Driver
executorIdleTimeout ???
cachedExecutorIdleTimeout infinity
initialExecutors 2
minExecutors 2
maxExecutors 10

Executor 1
Executor 2
Driver
Executor 4Executor 3
Executor 5
Executor 6

Executor 1
Executor 2
Driver
Executor 4Executor 3
Executor 5
Executor 6
Cache
Cache
Cache
Cache
Cache
Cache

User activity 1 ###
User activity 1 ###
User activity 1 ###
Dailypartition
User activity 1 ###
User activity 1 ###
User activity 1 ###
Dailypartition
Dataset A
User activity 2 ###
User activity 2 ###
User activity 2 ###
Dailypartition
User activity 2 ###
User activity 2 ###
User activity 2 ###
Dailypartition
Dataset B
?

How to join 30 days of A & B ?
1. Join element of B with the last element of A before
2. Element of A should happen not more than 6 hours before
joined element of B
3. Elements in different daily partitions need to be joined.

What about a clever coGroup?
A 1
B [1, 4]
C [2, 3]
A 3
B [1, 3]
D 2
cogroup =
A [1], [3]
B [1, 4], [1, 3]
C [2, 3], 2
D 2
● groupBy - group both datasets by user id
● cogroup - get all the events of A and B for a single user
● sort the events and filter the ones that match business logic

Dataset A
day X
Dataset A
day X + 1
union
Dataset B
day Xjoin
Join a single day

Dataset A
day X
Dataset A
day X + 1
union
Dataset B
day Xjoin
Iterate 30 transformations
Increment X

Story of some job
Do not infer schema
Define schema when
reading json data and
switch off schema
inferring.
Rewrite join
Rewrite 30 days join into
an iteration of 30 daily
transformations.

10 days join Before After
Nb of executors 50 20
Execution time 98 mins 23 sec 24 mins 22 sec
Yarn MB-seconds 2,768,187,887 229,664,298
Yarn
VCore-seconds
270,235 22,418

Key Takeaways
#1 Spark Performance may vary a lot.
#2 You never know how good Your job is.
#3 Experiment with runtime parameters.

“So many questions remain unanswered.
Perhaps we are poorer for having lost a
possible explanation or richer for having
gained a mystery.
But aren't both possibilities equally
intriguing?”

Thank You
Questions ?
pawel.leszczynski@allegro.pl

The Hidden Life of Spark Jobs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The Hidden Life of Spark Jobs

Ähnlich wie The Hidden Life of Spark Jobs (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Hidden Life of Spark Jobs