Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale specifically with Structured Streaming.
Know thy Lag
While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is.
Reading Data In
Fan Out Pattern using minPartitions to Use Kafka Efficiently
Overload protection using maxOffsetsPerTrigger
More Apache Spark Settings used to optimize Throughput
MicroBatching Best Practices
Map() +ForEach() vs MapPartitons + forEachPartition
Adobe Spark Speculation and its Effects
Calculating Streaming Statistics
Windowing
Importance of the State Store
RocksDB FTW
Broadcast joins
Custom Aggegators
OffHeap Counters using Redis
Pipelining
11. Read In
What can we optimize way upstream?
▪ maxOffsetsPerTrigger
Determine what QPS you want to hit
Observe your QPS
▪ minPartitions
▪ Enables a Fan-Out processing
pattern
▪ Maps 1. Kafka Partition to
multiple sub partitions
▪ Executor Resources
Keep this constant
Rinse and. Repeat till you have Throughput per Core
▪ Make sure processingTime <= TriggerInterval
If its <<<Trigger Interval, you have headroom to grow in QPS
13. MicroBatch Hard! Logic Best Practices
Pros
Easy to code
Cons
§ Slow!
§ No local aggregation , specify explicit
combiner
§ Too many individual tasks
Hard to get Connection Management
right
Pros
Explicit Connection Mangement
▪ Allows for good batching and re-use
Local Aggregations using HashMaps at
partition level
Cons
Needs more upfront memory
▪ OOM till tuning is done
§ Uglier to visualize
§ Might need some extra cpu per task
mapPartition() + forEachBatch()
map() + foreach()
15. Speculate Away!
What can we optimize way upstream?
SparkConf Value Description
spark.speculation true
If set to "true", performs speculative
execution of tasks. This means if one or
more tasks are running slowly in a stage,
they will be re-launched.
spark.speculation.multiplier 5
How many times slower a task is than the
median to be considered for speculation.
spark.speculation.quantile 0.9
Fraction of tasks which must be
complete before speculation is enabled
for a particular stage.
18. Different Scenarios
To cater simple scenarios
Key Value
<8pm- 9pm> purchase 500
<8pm- 9pm> addToCart 5000
…….
<9pm- 10pm> addToCart 70
Results Table In StateStore
Key Value
<8pm- 9pm> product1 20
<8pm- 9pm> product2 30
…….
<9pm- 10pm> product100002 2
…..
Results Table In StateStore
Lower Cardinality
Very High Cardinality! ☠
😎
19. StateStore Issues
• By default, stored in Memory i.e managed by JVM
• Large number of Keys => GC Pause
• GC Pauses => Higher latencies and increased lag
• Switch to off-heap State Store
• One Example
• Can manage way more keys in the Statestore safely
• Implement your own persistent off heap state store by extending
StateStoreProvider one example is Redis
• Also try to keep shorter windows
20. Skew is Real!
▪ Default Partitioning of the dataframe might not be ideal
▪ Some partitions can have too much data
▪ Processing those can cause OOM/connection failures
▪ Repartition is your friend
▪ Might not still be enough, add some salt!
9997/10000 tasks don’t matter. The 3/10000 that fails is that matters
21. How to get the magic targetPartitionCount?
▪ When reading/wriIng to parquet on HDFS, many
recomendaIons to mimic the HDFS block size (default:
128MB)
▪ Sample a small porIon of your large DF
▪ Df.head might suffice too with a large enough sample
▪ EsImate size of each row and extrapolate
Sample here and sample there!
22. Digging into Redis Pipelining + Spark
From https://redis.io/topics/pipelining
Without Pipelining With Pipelining