SlideShare a Scribd company logo
1 of 46
Download to read offline
Performant Streaming in
Production: Part 1
Max Thöne Stefan van Wouw
Resident Solutions Architect Sr. Resident Solutions Architect
Notebooks
▪ To explore the demos we have shown, find the link to the notebooks
here
About the Speakers
Stefan van Wouw
Sr. Resident
Solutions Architect
Databricks
Max Thöne
Resident Solutions
Architect
Databricks
This talk
Part 1
Introduction
What parts of a stream should be tuned
Input Parameters
Optimal mini batch size
Part 2
State Parameters
Limiting the state dimension
Output Parameters
Do not be a bully for downstream jobs
Deployment
Considerations after deploying to PROD
Introduction
Suppose we have a stream set up like this
STRUCTURED
STREAMING
Message Source based stream
spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “...”)
.option(“subscribe”, “topic”)
.load()
.selectExpr(“cast (value as string) as json”)
.select(from_json(“json”, schema).as(“data”))
.writeStream
.format(“delta”)
.option(“path”, “/deltaTable/”)
.trigger(“1 minute”)
.option(“checkpointLocation”, “...”)
.start()
Or a stream like this
spark
.readStream
.format(“delta”)
.load(“/salesDeltaIn/”)
.withColumn(“item_id”, col(“data.item_id”))
.writeStream
.format(“delta”)
.option(“path”, “/deltaTableOut/”)
.trigger(“1 minute”)
.option(“checkpointLocation”, “...”)
.start()
STRUCTURED
STREAMING
File Source based stream
Maybe even a stream like this (joins)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
spark
.readStream
…
.join(itemDF, “item_id”)
…
.writeStream
…
.start()
Maybe even a stream like this (stateful operations)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
spark
.readStream
…
.groupBy(“item_id”)
.count()
…
.writeStream
…
.start()
Scale dimensions
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
How do we correctly tune this?
Let’s use this example! 1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
3. Aggregate sales per item category
per hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input Parameters
Limiting the input dimension
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD N
Input
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Limit n in O(n⨯m)
Why are input parameters important?
▪ Allows you to control the mini-batch size.
▪ Optimal mini-batch size → Optimal cluster usage.
▪ Suboptimal mini-batch size → performance cliff.
▪ Shuffle Spill
▪ Different Query Plan (Sort Merge Join vs Broadcast Join)
What input parameters are we talking about?
File Source
▪ Any: maxFilesPerTrigger
▪ Delta Lake:
+maxBytesPerTrigger
Message Source
▪ Kafka: maxOffsetsPerTrigger
▪ Kinesis: fetchBufferSize
▪ EventHubs:
maxEventsPerTrigger
▪ Controls the size of each mini-batch
▪ Especially important in relation to
shuffle partitions
Input Parameters Example: Stream-Static Join
What is a Stream-Static join?
▪ Joining a streaming df to a static df
▪ Induces a shuffling step.
1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
Input Parameters: Not tuning maxFilesPerTrigger
What will happen when not setting maxFilesPerTrigger?
▪ For Delta: Default option is 1000 files. Each file is ~200 MB.
▪ For Message and other File based input: Default option is unlimited.
▪ Leads to a massive mini-batch!
▪ When you have shuffle operations → Spill.
Input Parameters: Tuning maxFilesPerTrigger
Base it on shuffle partition size
▪ Rule of thumb 1: Optimal shuffle partition size ~100-200 MB
▪ Rule of thumb 2: Set shuffle partitions equal to # of cores = 20.
▪ Use Spark UI to tune maxFilesPerTrigger until you get ~100-200 MB
per partition.
▪ Note: Size on disk is not a good proxy for size in memory
▪ Reason is that file size is different from the size in cluster memory
Significant performance improvement by removing spill
▪ maxFilesPerTrigger tuned to 6 files.
▪ Shuffle partitions tuned to 20.
▪ Processed Records/Seconds increased by 30%
Tuning maxFilesPerTrigger: Result
Sort Merge Join vs Broadcast Hash Join
We are not done yet!
▪ Currently we use a Sort Merge Join.
▪ Our static DF is small enough to broadcast it.
▪ Leads to 70% increased throughput!
▪ Can also increase maxFilesPerTrigger
▪ Because of no more risk of Shuffle Spill (shuffles were removed)
Demo: Input Parameters
▪ Explore the demo notebooks on the first topic: Input Parameters
Input Parameters: Summary
Main takeaways
▪ Set shuffle partitions to # Cores (assuming no skew)
▪ Tune maxFilesPerTrigger so you end up with 150-200 MB / Shuffle
Partition
▪ Try to make use of broadcasting whenever possible
Performant Streaming in
Production: Part 2
Max Thone Stefan van Wouw
Resident Solutions Architect Sr. Resident Solutions Architect
State Parameters
Limiting the state dimension
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
Limit m in O(n⨯m)
Limiting the state dimension
What we mean by state
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
▪ State Store backed operations
▪ Stateful (windowed) aggregations
▪ Drop duplicates
▪ Stream-Stream Joins
▪ Delta Lake table or external system
▪ Stream-Static Join / MERGE
Why are state parameters important?
▪ Optimal parameters → Optimal cluster usage
▪ If not controlled, state explosion can occur
▪ Slower stream performance over time
▪ Heavy shuffle spill (Joins/MERGE)
▪ Out of memory errors (State Store backed operations)
What parameters are we talking about?
▪ How much history to compare
against (watermarking)
▪ What state store backend to use
(RocksDB / Default)
▪ How much history to compare
against (query predicate)
State Store agnostic (Stream-Static Join /
MERGE)
State Store specific
State parameters example
▪ Extending the earlier code
sample with stateful aggregation
▪ E.g. Calculating the number of
sales per item category per hour
▪ Two types of state dimension
here:
a. Static side of the stream-static join (items)
b. State Store backed operation (windowed stateful
aggregation)
1. Main input stream
salesSDF = (
spark
.readStream
.format("delta")
.table("sales")
)
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
3. Aggregate sales per item per hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)
Demo: State Parameters
▪ Explore the demo notebooks on the second topic: State Parameters
State Parameters: Summary
Main takeaways
▪ Limit state accumulation with appropriate watermark
▪ The more granular the aggregate key / window, the more state
▪ Delta Backed State might provide more flexibility at cost of latency
Output Parameters
How output parameters influence the scale
dimensions
RECORD 1
RECORD 2
...
RECORD N
STRUCTURED
STREAMING
State
RECORD 1
RECORD 2
...
RECORD M
Input
Input size n (records
in mini-batch)
State size m
(records to be
compared against)
Why are output parameters important?
▪ Streaming jobs tend to create many small files
▪ Reading a folder with many small files is slow
▪ Degrading performance for downstream jobs / self-joins
What Output parameters are we talking about?
▪ Manually using repartition
▪ Delta Lake: Auto-Optimize
https://docs.databricks.com/delta/optimizations/auto-optimize.html
Demo: Output Parameters
▪ Explore the demo notebooks on the third topic: Output Parameters
Output Parameters: Summary
Main takeaways
▪ High number of files impact performance
▪ 10x speed difference can easily be demonstrated
How to keep your streams
performant after deployment
Multiple streams per Spark cluster
▪ Some small streams do not warrant their own cluster
▪ Packing them together in one Spark application might be a good
option, but then they share driver process which has performance
impact
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
SPARK
APPLICATION
Temporary changes to load (elasticity)
▪ Temporary scaling up a streaming cluster to handle backlog
▪ Can only scale out until #cores <= #shuffle partitions
Permanent changes to load (capacity planning)
▪ Permanent load increase warrants capacity planning
▪ Requires checkpoint wipe-out since shuffle partitions is fixed per
checkpoint location!
▪ Think of strategy to recover state (if necessary)
Summary
Summary
Limit state
accumulation
Limit how far you
look back (history)
State ParametersInput Parameters
Prevent generating
many small files (10x
faster)
Output Parameters
Capacity planning is
needed due to
deployment bound
parameters
Have a strategy for
checkpoint reset
Deployment
Limit input size
Tune shuffle
partitions / cores
(30% faster)
Enforce
broadcasting when
possible (2x faster)
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

More Related Content

What's hot

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 

What's hot (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 

Similar to Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsDatabricks
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Scott Mansfield
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Aljoscha Krettek
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way Dori Waldman
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryoguest40fc7cd
 
Project Portfolio - Transferable Skills
Project Portfolio - Transferable SkillsProject Portfolio - Transferable Skills
Project Portfolio - Transferable Skillstuleyb
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 

Similar to Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs (20)

Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale Joins
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
Project Portfolio - Transferable Skills
Project Portfolio - Transferable SkillsProject Portfolio - Transferable Skills
Project Portfolio - Transferable Skills
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

  • 1. Performant Streaming in Production: Part 1 Max Thöne Stefan van Wouw Resident Solutions Architect Sr. Resident Solutions Architect
  • 2. Notebooks ▪ To explore the demos we have shown, find the link to the notebooks here
  • 3. About the Speakers Stefan van Wouw Sr. Resident Solutions Architect Databricks Max Thöne Resident Solutions Architect Databricks
  • 4. This talk Part 1 Introduction What parts of a stream should be tuned Input Parameters Optimal mini batch size Part 2 State Parameters Limiting the state dimension Output Parameters Do not be a bully for downstream jobs Deployment Considerations after deploying to PROD
  • 6. Suppose we have a stream set up like this STRUCTURED STREAMING Message Source based stream spark .readStream .format(“kafka”) .option(“kafka.bootstrap.servers”, “...”) .option(“subscribe”, “topic”) .load() .selectExpr(“cast (value as string) as json”) .select(from_json(“json”, schema).as(“data”)) .writeStream .format(“delta”) .option(“path”, “/deltaTable/”) .trigger(“1 minute”) .option(“checkpointLocation”, “...”) .start()
  • 7. Or a stream like this spark .readStream .format(“delta”) .load(“/salesDeltaIn/”) .withColumn(“item_id”, col(“data.item_id”)) .writeStream .format(“delta”) .option(“path”, “/deltaTableOut/”) .trigger(“1 minute”) .option(“checkpointLocation”, “...”) .start() STRUCTURED STREAMING File Source based stream
  • 8. Maybe even a stream like this (joins) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M spark .readStream … .join(itemDF, “item_id”) … .writeStream … .start()
  • 9. Maybe even a stream like this (stateful operations) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M spark .readStream … .groupBy(“item_id”) .count() … .writeStream … .start()
  • 10. Scale dimensions Input size n (records in mini-batch) State size m (records to be compared against) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M
  • 11. How do we correctly tune this?
  • 12. Let’s use this example! 1. Main input stream salesSDF = ( spark .readStream .format("delta") .table("sales") ) 2. Join item category lookup itemSalesSDF = ( salesSDF .join( spark.table(“items”), “item_id”) ) 3. Aggregate sales per item category per hour itemSalesPerHourSDF = ( itemSalesSDF .groupBy(window(..., “1 hour”), “item_category”) .sum(“revenue”) ) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M
  • 14. Limiting the input dimension Input size n (records in mini-batch) State size m (records to be compared against) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Limit n in O(n⨯m)
  • 15. Why are input parameters important? ▪ Allows you to control the mini-batch size. ▪ Optimal mini-batch size → Optimal cluster usage. ▪ Suboptimal mini-batch size → performance cliff. ▪ Shuffle Spill ▪ Different Query Plan (Sort Merge Join vs Broadcast Join)
  • 16. What input parameters are we talking about? File Source ▪ Any: maxFilesPerTrigger ▪ Delta Lake: +maxBytesPerTrigger Message Source ▪ Kafka: maxOffsetsPerTrigger ▪ Kinesis: fetchBufferSize ▪ EventHubs: maxEventsPerTrigger ▪ Controls the size of each mini-batch ▪ Especially important in relation to shuffle partitions
  • 17. Input Parameters Example: Stream-Static Join What is a Stream-Static join? ▪ Joining a streaming df to a static df ▪ Induces a shuffling step. 1. Main input stream salesSDF = ( spark .readStream .format("delta") .table("sales") ) 2. Join item category lookup itemSalesSDF = ( salesSDF .join( spark.table(“items”), “item_id”) )
  • 18. Input Parameters: Not tuning maxFilesPerTrigger What will happen when not setting maxFilesPerTrigger? ▪ For Delta: Default option is 1000 files. Each file is ~200 MB. ▪ For Message and other File based input: Default option is unlimited. ▪ Leads to a massive mini-batch! ▪ When you have shuffle operations → Spill.
  • 19. Input Parameters: Tuning maxFilesPerTrigger Base it on shuffle partition size ▪ Rule of thumb 1: Optimal shuffle partition size ~100-200 MB ▪ Rule of thumb 2: Set shuffle partitions equal to # of cores = 20. ▪ Use Spark UI to tune maxFilesPerTrigger until you get ~100-200 MB per partition. ▪ Note: Size on disk is not a good proxy for size in memory ▪ Reason is that file size is different from the size in cluster memory
  • 20. Significant performance improvement by removing spill ▪ maxFilesPerTrigger tuned to 6 files. ▪ Shuffle partitions tuned to 20. ▪ Processed Records/Seconds increased by 30% Tuning maxFilesPerTrigger: Result
  • 21. Sort Merge Join vs Broadcast Hash Join We are not done yet! ▪ Currently we use a Sort Merge Join. ▪ Our static DF is small enough to broadcast it. ▪ Leads to 70% increased throughput! ▪ Can also increase maxFilesPerTrigger ▪ Because of no more risk of Shuffle Spill (shuffles were removed)
  • 22. Demo: Input Parameters ▪ Explore the demo notebooks on the first topic: Input Parameters
  • 23. Input Parameters: Summary Main takeaways ▪ Set shuffle partitions to # Cores (assuming no skew) ▪ Tune maxFilesPerTrigger so you end up with 150-200 MB / Shuffle Partition ▪ Try to make use of broadcasting whenever possible
  • 24. Performant Streaming in Production: Part 2 Max Thone Stefan van Wouw Resident Solutions Architect Sr. Resident Solutions Architect
  • 26. Limiting the state dimension RECORD 1 RECORD 2 ... RECORD N STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Input Input size n (records in mini-batch) State size m (records to be compared against) Limit m in O(n⨯m)
  • 27. Limiting the state dimension What we mean by state RECORD 1 RECORD 2 ... RECORD N STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Input ▪ State Store backed operations ▪ Stateful (windowed) aggregations ▪ Drop duplicates ▪ Stream-Stream Joins ▪ Delta Lake table or external system ▪ Stream-Static Join / MERGE
  • 28. Why are state parameters important? ▪ Optimal parameters → Optimal cluster usage ▪ If not controlled, state explosion can occur ▪ Slower stream performance over time ▪ Heavy shuffle spill (Joins/MERGE) ▪ Out of memory errors (State Store backed operations)
  • 29. What parameters are we talking about? ▪ How much history to compare against (watermarking) ▪ What state store backend to use (RocksDB / Default) ▪ How much history to compare against (query predicate) State Store agnostic (Stream-Static Join / MERGE) State Store specific
  • 30. State parameters example ▪ Extending the earlier code sample with stateful aggregation ▪ E.g. Calculating the number of sales per item category per hour ▪ Two types of state dimension here: a. Static side of the stream-static join (items) b. State Store backed operation (windowed stateful aggregation) 1. Main input stream salesSDF = ( spark .readStream .format("delta") .table("sales") ) 2. Join item category lookup itemSalesSDF = ( salesSDF .join( spark.table(“items”), “item_id”) ) 3. Aggregate sales per item per hour itemSalesPerHourSDF = ( itemSalesSDF .groupBy(window(..., “1 hour”), “item_category”) .sum(“revenue”) )
  • 31. Demo: State Parameters ▪ Explore the demo notebooks on the second topic: State Parameters
  • 32. State Parameters: Summary Main takeaways ▪ Limit state accumulation with appropriate watermark ▪ The more granular the aggregate key / window, the more state ▪ Delta Backed State might provide more flexibility at cost of latency
  • 34. How output parameters influence the scale dimensions RECORD 1 RECORD 2 ... RECORD N STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Input Input size n (records in mini-batch) State size m (records to be compared against)
  • 35. Why are output parameters important? ▪ Streaming jobs tend to create many small files ▪ Reading a folder with many small files is slow ▪ Degrading performance for downstream jobs / self-joins
  • 36. What Output parameters are we talking about? ▪ Manually using repartition ▪ Delta Lake: Auto-Optimize https://docs.databricks.com/delta/optimizations/auto-optimize.html
  • 37. Demo: Output Parameters ▪ Explore the demo notebooks on the third topic: Output Parameters
  • 38. Output Parameters: Summary Main takeaways ▪ High number of files impact performance ▪ 10x speed difference can easily be demonstrated
  • 39. How to keep your streams performant after deployment
  • 40. Multiple streams per Spark cluster ▪ Some small streams do not warrant their own cluster ▪ Packing them together in one Spark application might be a good option, but then they share driver process which has performance impact STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING SPARK APPLICATION
  • 41. Temporary changes to load (elasticity) ▪ Temporary scaling up a streaming cluster to handle backlog ▪ Can only scale out until #cores <= #shuffle partitions
  • 42. Permanent changes to load (capacity planning) ▪ Permanent load increase warrants capacity planning ▪ Requires checkpoint wipe-out since shuffle partitions is fixed per checkpoint location! ▪ Think of strategy to recover state (if necessary)
  • 44. Summary Limit state accumulation Limit how far you look back (history) State ParametersInput Parameters Prevent generating many small files (10x faster) Output Parameters Capacity planning is needed due to deployment bound parameters Have a strategy for checkpoint reset Deployment Limit input size Tune shuffle partitions / cores (30% faster) Enforce broadcasting when possible (2x faster)
  • 45. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.