Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly.
3. About the Speakers
Stefan van Wouw
Sr. Resident
Solutions Architect
Databricks
Max Thöne
Resident Solutions
Architect
Databricks
4. This talk
Part 1
Introduction
What parts of a stream should be tuned
Input Parameters
Optimal mini batch size
Part 2
State Parameters
Limiting the state dimension
Output Parameters
Do not be a bully for downstream jobs
Deployment
Considerations after deploying to PROD
6. Suppose we have a stream set up like this
STRUCTURED
STREAMING
Message Source based stream
spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “...”)
.option(“subscribe”, “topic”)
.load()
.selectExpr(“cast (value as string) as json”)
.select(from_json(“json”, schema).as(“data”))
.writeStream
.format(“delta”)
.option(“path”, “/deltaTable/”)
.trigger(“1 minute”)
.option(“checkpointLocation”, “...”)
.start()
7. Or a stream like this
spark
.readStream
.format(“delta”)
.load(“/salesDeltaIn/”)
.withColumn(“item_id”, col(“data.item_id”))
.writeStream
.format(“delta”)
.option(“path”, “/deltaTableOut/”)
.trigger(“1 minute”)
.option(“checkpointLocation”, “...”)
.start()
STRUCTURED
STREAMING
File Source based stream
8. Maybe even a stream like this (joins)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
spark
.readStream
…
.join(itemDF, “item_id”)
…
.writeStream
…
.start()
9. Maybe even a stream like this (stateful operations)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
spark
.readStream
…
.groupBy(“item_id”)
.count()
…
.writeStream
…
.start()
10. Scale dimensions
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
12. Let’s use this example! 1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
3. Aggregate sales per item category
per hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
14. Limiting the input dimension
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Limit n in O(n⨯m)
15. Why are input parameters important?
▪ Allows you to control the mini-batch size.
▪ Optimal mini-batch size → Optimal cluster usage.
▪ Suboptimal mini-batch size → performance cliff.
▪ Shuffle Spill
▪ Different Query Plan (Sort Merge Join vs Broadcast Join)
16. What input parameters are we talking about?
File Source
▪ Any: maxFilesPerTrigger
▪ Delta Lake:
+maxBytesPerTrigger
Message Source
▪ Kafka: maxOffsetsPerTrigger
▪ Kinesis: fetchBufferSize
▪ EventHubs:
maxEventsPerTrigger
▪ Controls the size of each mini-batch
▪ Especially important in relation to
shuffle partitions
17. Input Parameters Example: Stream-Static Join
What is a Stream-Static join?
▪ Joining a streaming df to a static df
▪ Induces a shuffling step.
1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
18. Input Parameters: Not tuning maxFilesPerTrigger
What will happen when not setting maxFilesPerTrigger?
▪ For Delta: Default option is 1000 files. Each file is ~200 MB.
▪ For Message and other File based input: Default option is unlimited.
▪ Leads to a massive mini-batch!
▪ When you have shuffle operations → Spill.
19. Input Parameters: Tuning maxFilesPerTrigger
Base it on shuffle partition size
▪ Rule of thumb 1: Optimal shuffle partition size ~100-200 MB
▪ Rule of thumb 2: Set shuffle partitions equal to # of cores = 20.
▪ Use Spark UI to tune maxFilesPerTrigger until you get ~100-200 MB
per partition.
▪ Note: Size on disk is not a good proxy for size in memory
▪ Reason is that file size is different from the size in cluster memory
20. Significant performance improvement by removing spill
▪ maxFilesPerTrigger tuned to 6 files.
▪ Shuffle partitions tuned to 20.
▪ Processed Records/Seconds increased by 30%
Tuning maxFilesPerTrigger: Result
21. Sort Merge Join vs Broadcast Hash Join
We are not done yet!
▪ Currently we use a Sort Merge Join.
▪ Our static DF is small enough to broadcast it.
▪ Leads to 70% increased throughput!
▪ Can also increase maxFilesPerTrigger
▪ Because of no more risk of Shuffle Spill (shuffles were removed)
23. Input Parameters: Summary
Main takeaways
▪ Set shuffle partitions to # Cores (assuming no skew)
▪ Tune maxFilesPerTrigger so you end up with 150-200 MB / Shuffle
Partition
▪ Try to make use of broadcasting whenever possible
26. Limiting the state dimension
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
Limit m in O(n⨯m)
27. Limiting the state dimension
What we mean by state
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
▪ State Store backed operations
▪ Stateful (windowed) aggregations
▪ Drop duplicates
▪ Stream-Stream Joins
▪ Delta Lake table or external system
▪ Stream-Static Join / MERGE
28. Why are state parameters important?
▪ Optimal parameters → Optimal cluster usage
▪ If not controlled, state explosion can occur
▪ Slower stream performance over time
▪ Heavy shuffle spill (Joins/MERGE)
▪ Out of memory errors (State Store backed operations)
29. What parameters are we talking about?
▪ How much history to compare
against (watermarking)
▪ What state store backend to use
(RocksDB / Default)
▪ How much history to compare
against (query predicate)
State Store agnostic (Stream-Static Join /
MERGE)
State Store specific
30. State parameters example
▪ Extending the earlier code
sample with stateful aggregation
▪ E.g. Calculating the number of
sales per item category per hour
▪ Two types of state dimension
here:
a. Static side of the stream-static join (items)
b. State Store backed operation (windowed stateful
aggregation)
1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
3. Aggregate sales per item per hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)
32. State Parameters: Summary
Main takeaways
▪ Limit state accumulation with appropriate watermark
▪ The more granular the aggregate key / window, the more state
▪ Delta Backed State might provide more flexibility at cost of latency
34. How output parameters influence the scale
dimensions
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
35. Why are output parameters important?
▪ Streaming jobs tend to create many small files
▪ Reading a folder with many small files is slow
▪ Degrading performance for downstream jobs / self-joins
36. What Output parameters are we talking about?
▪ Manually using repartition
▪ Delta Lake: Auto-Optimize
https://docs.databricks.com/delta/optimizations/auto-optimize.html
38. Output Parameters: Summary
Main takeaways
▪ High number of files impact performance
▪ 10x speed difference can easily be demonstrated
39. How to keep your streams
performant after deployment
40. Multiple streams per Spark cluster
▪ Some small streams do not warrant their own cluster
▪ Packing them together in one Spark application might be a good
option, but then they share driver process which has performance
impact
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
SPARK
APPLICATION
41. Temporary changes to load (elasticity)
▪ Temporary scaling up a streaming cluster to handle backlog
▪ Can only scale out until #cores <= #shuffle partitions
42. Permanent changes to load (capacity planning)
▪ Permanent load increase warrants capacity planning
▪ Requires checkpoint wipe-out since shuffle partitions is fixed per
checkpoint location!
▪ Think of strategy to recover state (if necessary)
44. Summary
Limit state
accumulation
Limit how far you
look back (history)
State ParametersInput Parameters
Prevent generating
many small files (10x
faster)
Output Parameters
Capacity planning is
needed due to
deployment bound
parameters
Have a strategy for
checkpoint reset
Deployment
Limit input size
Tune shuffle
partitions / cores
(30% faster)
Enforce
broadcasting when
possible (2x faster)