Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more.
11. “If a tree falls in the forest there are
other trees listening.”
12. Preempted executors
● Containers killed by YARN (exceeding soft limits on queue)
● More executors does not have to speed up the processing!
● Check Yarn logs.
14. “An organism that is too greedy and takes too
much without giving anything in return
destroys what it needs for life.”
15. Driver Issues
● Collecting to much data
spark.driver.memory 1G
spark.driver.memoryOverhead spark.driver.memory * 0.10 + 384M
spark.driver.maxResultSize 1G
spark.driver.cores 1
16. Putting all debug information together
Spark Web UI
Available when a job is
running. SparkHistory
Server enables to view
Spark Web UI after job
finishes
What data is processed?
How does execution plan
look like?
Any data skews?
Application logs
Check application logs for
possible errors and
exceptions in a job logic.
yarn logs --applicationId
Yarn Internal Logs
Check the logs of Yarn
containers processing the
job.
Do they get preempted?
Do they exceed container
memory limit?
17. CSV or JSON processing takes way too long
val samplingRatio =
parameters.get("samplingRatio")
.map(_.toDouble).getOrElse(1.0)
24. User activity 1 ###
User activity 1 ###
User activity 1 ###
Dailypartition
User activity 1 ###
User activity 1 ###
User activity 1 ###
Dailypartition
Dataset A
User activity 2 ###
User activity 2 ###
User activity 2 ###
Dailypartition
User activity 2 ###
User activity 2 ###
User activity 2 ###
Dailypartition
Dataset B
?
25. How to join 30 days of A & B ?
1. Join element of B with the last element of A before
2. Element of A should happen not more than 6 hours before
joined element of B
3. Elements in different daily partitions need to be joined.
26. What about a clever coGroup?
A 1
B [1, 4]
C [2, 3]
A 3
B [1, 3]
D 2
cogroup =
A [1], [3]
B [1, 4], [1, 3]
C [2, 3], 2
D 2
● groupBy - group both datasets by user id
● cogroup - get all the events of A and B for a single user
● sort the events and filter the ones that match business logic
28. Dataset A
day X
Dataset A
day X + 1
union
Dataset B
day Xjoin
Iterate 30 transformations
Increment X
29. Story of some job
Do not infer schema
Define schema when
reading json data and
switch off schema
inferring.
Rewrite join
Rewrite 30 days join into
an iteration of 30 daily
transformations.
30. 10 days join Before After
Nb of executors 50 20
Execution time 98 mins 23 sec 24 mins 22 sec
Yarn MB-seconds 2,768,187,887 229,664,298
Yarn
VCore-seconds
270,235 22,418
31. Key Takeaways
#1 Spark Performance may vary a lot.
#2 You never know how good Your job is.
#3 Experiment with runtime parameters.
32. “So many questions remain unanswered.
Perhaps we are poorer for having lost a
possible explanation or richer for having
gained a mystery.
But aren't both possibilities equally
intriguing?”