Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

Performant Streaming in
Production: Part 1
Max Thöne Stefan van Wouw
Resident Solutions Architect Sr. Resident Solutions Architect

Notebooks
▪ To explore the demos we have shown, ﬁnd the link to the notebooks
here

About the Speakers
Stefan van Wouw
Sr. Resident
Solutions Architect
Databricks
Max Thöne
Resident Solutions
Architect
Databricks

This talk
Part 1
Introduction
What parts of a stream should be tuned
Input Parameters
Optimal mini batch size
Part 2
State Parameters
Limiting the state dimension
Output Parameters
Do not be a bully for downstream jobs
Deployment
Considerations after deploying to PROD

Suppose we have a stream set up like this
STRUCTURED
STREAMING
Message Source based stream
spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “...”)
.option(“subscribe”, “topic”)
.load()
.selectExpr(“cast (value as string) as json”)
.select(from_json(“json”, schema).as(“data”))
.writeStream
.format(“delta”)
.option(“path”, “/deltaTable/”)
.trigger(“1 minute”)
.option(“checkpointLocation”, “...”)
.start()

Or a stream like this
spark
.readStream
.load(“/salesDeltaIn/”)
.withColumn(“item_id”, col(“data.item_id”))
.writeStream
.option(“path”, “/deltaTableOut/”)
.trigger(“1 minute”)
.option(“checkpointLocation”, “...”)
.start()
STRUCTURED
STREAMING
File Source based stream

Maybe even a stream like this (joins)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
spark
.readStream
…
.join(itemDF, “item_id”)
…
.writeStream
…
.start()

Maybe even a stream like this (stateful operations)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
spark
.readStream
…
.groupBy(“item_id”)
.count()
…
.writeStream
…
.start()

Scale dimensions
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M

How do we correctly tune this?

Let’s use this example! 1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
3. Aggregate sales per item category
per hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M

Limiting the input dimension
in mini-batch)
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Limit n in O(n⨯m)

Why are input parameters important?
▪ Allows you to control the mini-batch size.
▪ Optimal mini-batch size → Optimal cluster usage.
▪ Suboptimal mini-batch size → performance cliff.
▪ Shuffle Spill
▪ Different Query Plan (Sort Merge Join vs Broadcast Join)

What input parameters are we talking about?
File Source
▪ Any: maxFilesPerTrigger
▪ Delta Lake:
+maxBytesPerTrigger
Message Source
▪ Kafka: maxOffsetsPerTrigger
▪ Kinesis: fetchBufferSize
▪ EventHubs:
maxEventsPerTrigger
▪ Controls the size of each mini-batch
▪ Especially important in relation to
shuffle partitions

Input Parameters Example: Stream-Static Join
What is a Stream-Static join?
▪ Joining a streaming df to a static df
▪ Induces a shuffling step.
1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
itemSalesSDF = (
salesSDF
)

Input Parameters: Not tuning maxFilesPerTrigger
What will happen when not setting maxFilesPerTrigger?
▪ For Delta: Default option is 1000 ﬁles. Each ﬁle is ~200 MB.
▪ For Message and other File based input: Default option is unlimited.
▪ Leads to a massive mini-batch!
▪ When you have shuffle operations → Spill.

Input Parameters: Tuning maxFilesPerTrigger
Base it on shuffle partition size
▪ Rule of thumb 1: Optimal shuffle partition size ~100-200 MB
▪ Rule of thumb 2: Set shuffle partitions equal to # of cores = 20.
▪ Use Spark UI to tune maxFilesPerTrigger until you get ~100-200 MB
per partition.
▪ Note: Size on disk is not a good proxy for size in memory
▪ Reason is that ﬁle size is different from the size in cluster memory

Signiﬁcant performance improvement by removing spill
▪ maxFilesPerTrigger tuned to 6 ﬁles.
▪ Shuffle partitions tuned to 20.
▪ Processed Records/Seconds increased by 30%
Tuning maxFilesPerTrigger: Result

Sort Merge Join vs Broadcast Hash Join
We are not done yet!
▪ Currently we use a Sort Merge Join.
▪ Our static DF is small enough to broadcast it.
▪ Leads to 70% increased throughput!
▪ Can also increase maxFilesPerTrigger
▪ Because of no more risk of Shuffle Spill (shuffles were removed)

Demo: Input Parameters
▪ Explore the demo notebooks on the ﬁrst topic: Input Parameters

Input Parameters: Summary
Main takeaways
▪ Set shuffle partitions to # Cores (assuming no skew)
▪ Tune maxFilesPerTrigger so you end up with 150-200 MB / Shuffle
Partition
▪ Try to make use of broadcasting whenever possible

Performant Streaming in
Production: Part 2
Max Thone Stefan van Wouw
Resident Solutions Architect Sr. Resident Solutions Architect

RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
in mini-batch)
State size m
(records to be
compared against)
Limit m in O(n⨯m)

What we mean by state
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
▪ State Store backed operations
▪ Stateful (windowed) aggregations
▪ Drop duplicates
▪ Stream-Stream Joins
▪ Delta Lake table or external system
▪ Stream-Static Join / MERGE

Why are state parameters important?
▪ Optimal parameters → Optimal cluster usage
▪ If not controlled, state explosion can occur
▪ Slower stream performance over time
▪ Heavy shuffle spill (Joins/MERGE)
▪ Out of memory errors (State Store backed operations)

What parameters are we talking about?
▪ How much history to compare
against (watermarking)
▪ What state store backend to use
(RocksDB / Default)
▪ How much history to compare
against (query predicate)
State Store agnostic (Stream-Static Join /
MERGE)
State Store speciﬁc

State parameters example
▪ Extending the earlier code
sample with stateful aggregation
▪ E.g. Calculating the number of
sales per item category per hour
▪ Two types of state dimension
here:
a. Static side of the stream-static join (items)
b. State Store backed operation (windowed stateful
aggregation)
1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
itemSalesSDF = (
salesSDF
)
3. Aggregate sales per item per hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)

Demo: State Parameters
▪ Explore the demo notebooks on the second topic: State Parameters

State Parameters: Summary
Main takeaways
▪ Limit state accumulation with appropriate watermark
▪ The more granular the aggregate key / window, the more state
▪ Delta Backed State might provide more ﬂexibility at cost of latency

How output parameters inﬂuence the scale
dimensions
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
in mini-batch)
State size m
(records to be
compared against)

Why are output parameters important?
▪ Streaming jobs tend to create many small ﬁles
▪ Reading a folder with many small ﬁles is slow
▪ Degrading performance for downstream jobs / self-joins

What Output parameters are we talking about?
▪ Manually using repartition
▪ Delta Lake: Auto-Optimize
https://docs.databricks.com/delta/optimizations/auto-optimize.html

Demo: Output Parameters
▪ Explore the demo notebooks on the third topic: Output Parameters

Output Parameters: Summary
Main takeaways
▪ High number of ﬁles impact performance
▪ 10x speed difference can easily be demonstrated

How to keep your streams
performant after deployment

Multiple streams per Spark cluster
▪ Some small streams do not warrant their own cluster
▪ Packing them together in one Spark application might be a good
option, but then they share driver process which has performance
impact
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
SPARK
APPLICATION

Temporary changes to load (elasticity)
▪ Temporary scaling up a streaming cluster to handle backlog
▪ Can only scale out until #cores <= #shuffle partitions

Permanent changes to load (capacity planning)
▪ Permanent load increase warrants capacity planning
▪ Requires checkpoint wipe-out since shuffle partitions is ﬁxed per
checkpoint location!
▪ Think of strategy to recover state (if necessary)

Summary
Limit state
accumulation
Limit how far you
look back (history)
State ParametersInput Parameters
Prevent generating
many small ﬁles (10x
faster)
Output Parameters
Capacity planning is
needed due to
deployment bound
parameters
Have a strategy for
checkpoint reset
Deployment
Limit input size
Tune shuffle
partitions / cores
(30% faster)
Enforce
broadcasting when
possible (2x faster)

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

Similar to Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs