SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Introduction to Joins in
Structured Streaming
Himanshu Gupta
Lead Consultant
Knoldus Software LLP
https://softwareengineeringdaily.com/2016/03/09/apache-spark-usage-python-or-scala/
Agenda
● Quick Recap
● Unsupported Operations
● Join Operations
● Stream-Static Joins
● Stream-Stream Joins
● Support Matrix for Joins in Streaming Queries
● Demo
Quick Recap
● A scalable and fault-tolerant stream processing engine built on
the Spark SQL engine.
● Allows us to express our streaming computation the same
way we would express a batch computation on static data.
● Uses the Dataset/DataFrame API in Scala, Java, Python or R
to express streaming aggregations, event-time windows, etc
● Leverages Spark SQL engine to optimize computation.
● Ensures end-to-end exactly-once fault-tolerance guarantees
through checkpointing & WALs.
Basic Concept
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Unsupported Operations
Unsupported operations in Structured Streaming are:
● Multiple streaming aggregations (i.e. a chain of aggregations
on a streaming DF/DS) are not yet supported on streaming
Datasets.
● Limit and take first N rows are not supported on streaming
Datasets.
● Distinct operations on streaming Datasets are not supported.
● Sorting operations are supported on streaming Datasets only
after an aggregation and in Complete Output Mode.
● Any kind of joins between two streaming Datasets is not yet
supported.
Join Operations
● Structured Streaming supports Stream-Static and Stream-
Stream joins.
● The result of the streaming join is generated incrementally.
● The result of the join with a streaming Dataset/DataFrame is
exactly the same as if it was with a static Dataset/DataFrame
containing the same data in the stream.
Stream-Static Joins
● Supported since Apache Spark 2.0.
● They are not stateful, so no state management is required.
val companiesDF =
spark.read.option("header", "true").csv("src/main/resources/companies.csv")
val stockStreamDF =
spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", topic).load()
.select(from_json(col("value").cast("string"), schema).as("value")).select("value.*")
val filteredStockStreamDF = stockStreamDF.join(companiesDF, "companyName")
Stream-Stream Joins
● Supported since Apache Spark 2.3.
● Challenge:
– At any point of time, the view of the dataset is incomplete for both sides of the
join making it much harder to find matches between inputs.
– Any row received from one input stream can match with any future, yet-to-be-
received row from the other input stream.
● Solution:
– For both the input streams, we have to buffer past input as streaming state, so
that we can match every future input with past input and accordingly generate
joined results.
– It also automatically handle late, out-of-order data and can limit the state using
watermarks.
Inner Join
● Any kind of columns along with any kind of join conditions are
supported.
● As the stream runs, the size of streaming state will keep
growing indefinitely as all past input must be saved as any new
input can match with any input from the past.
● To avoid an unbounded state, we have to define additional join
conditions such that indefinitely old inputs cannot match with
future inputs and therefore can be cleared from the state.
Example
Let’s say we want to join a stream of trading company names
with another stream of stocks to filter out the stocks that a stock
broker is interested in.
val companies = spark.readStream. ...
val stocks = spark.readStream. ...
// Join with event-time constraints
stocks.join(
companies,
expr("""
companyName = stockName AND stockInputTime >= companyTradingTime AND
stockInputTime <= companyTradingTime + interval 20 seconds
""")
)
Outer Join
● Similar to Inner Join, except that for Left & Right Outer Joins
watermarking + event time constraints should be specified.
● Because for generating the NULL results in outer join, the
engine must know when an input row is not going to match
with anything in future.
Example
Let’s say we want to keep the information of the stocks which
were not traded for future prospects.
// Apply watermarks on event-time columns
val companiesWithWatermark = companies.withWatermark("companiesTradingTime",
"10 seconds")
val stocksWithWatermark = stocks.withWatermark(”stockInputTime”, "20 seconds")
// Join with event-time constraints
stocksWithWatermark.join(
companiesWithWatermark,
expr("""
companyName = stockName AND stockInputTime >= companyTradingTime AND
stockInputTime <= companyTradingTime + interval 20 seconds
"""), joinType = "leftOuter"
)
Support Matrix for Joins in Streaming
Queries
Left Input Right Input Join Type Supported
Static Static All Types Yes
Stream Static Inner Yes
Left Outer Yes
Right Outer No
Full Outer No
Static Stream Inner Yes
Left Outer No
Right Outer Yes
Full Outer No
Stream Stream Inner Yes
Left Outer Yes (Conditionally)
Right Outer Yes (Conditionally)
Full Outer No
Code/Package -
Spark Structured Streaming Package
References
● http://spark.apache.org/docs/latest/structured-streaming-programming-guid
● https://github.com/apache/spark/tree/master/examples/src/main/scala/org/a
● https://databricks.com/session/a-deep-dive-into-structured-streaming
● https://www.gitbook.com/book/jaceklaskowski/spark-structured-streaming/d
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
Flink Forward
 

Was ist angesagt? (20)

Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
 
Asynchronous handlers in asp.net
Asynchronous handlers in asp.netAsynchronous handlers in asp.net
Asynchronous handlers in asp.net
 
Spark Your Legacy (Spark Summit 2016)
Spark Your Legacy (Spark Summit 2016)Spark Your Legacy (Spark Summit 2016)
Spark Your Legacy (Spark Summit 2016)
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
Gatling @ Scala.Io 2013
Gatling @ Scala.Io 2013Gatling @ Scala.Io 2013
Gatling @ Scala.Io 2013
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Samza sql stream processing meetup
Samza sql stream processing meetupSamza sql stream processing meetup
Samza sql stream processing meetup
 
Akka Streams
Akka StreamsAkka Streams
Akka Streams
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
 
Building a Cloud API Server using Play(SCALA) & Riak
Building a Cloud API Server using  Play(SCALA) & Riak Building a Cloud API Server using  Play(SCALA) & Riak
Building a Cloud API Server using Play(SCALA) & Riak
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 

Ähnlich wie Introduction to Joins in Structured Streaming

Ähnlich wie Introduction to Joins in Structured Streaming (20)

KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Using state-engine-as-sca-component-final
Using state-engine-as-sca-component-finalUsing state-engine-as-sca-component-final
Using state-engine-as-sca-component-final
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
GlueCon 2016 - Threading in JavaScript
GlueCon 2016 - Threading in JavaScriptGlueCon 2016 - Threading in JavaScript
GlueCon 2016 - Threading in JavaScript
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data Adapters
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 

Mehr von Knoldus Inc.

Mehr von Knoldus Inc. (20)

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 

Kürzlich hochgeladen

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

Introduction to Joins in Structured Streaming

  • 1. Introduction to Joins in Structured Streaming Himanshu Gupta Lead Consultant Knoldus Software LLP https://softwareengineeringdaily.com/2016/03/09/apache-spark-usage-python-or-scala/
  • 2. Agenda ● Quick Recap ● Unsupported Operations ● Join Operations ● Stream-Static Joins ● Stream-Stream Joins ● Support Matrix for Joins in Streaming Queries ● Demo
  • 3. Quick Recap ● A scalable and fault-tolerant stream processing engine built on the Spark SQL engine. ● Allows us to express our streaming computation the same way we would express a batch computation on static data. ● Uses the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, etc ● Leverages Spark SQL engine to optimize computation. ● Ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing & WALs.
  • 5. Unsupported Operations Unsupported operations in Structured Streaming are: ● Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF/DS) are not yet supported on streaming Datasets. ● Limit and take first N rows are not supported on streaming Datasets. ● Distinct operations on streaming Datasets are not supported. ● Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode. ● Any kind of joins between two streaming Datasets is not yet supported.
  • 6. Join Operations ● Structured Streaming supports Stream-Static and Stream- Stream joins. ● The result of the streaming join is generated incrementally. ● The result of the join with a streaming Dataset/DataFrame is exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.
  • 7. Stream-Static Joins ● Supported since Apache Spark 2.0. ● They are not stateful, so no state management is required. val companiesDF = spark.read.option("header", "true").csv("src/main/resources/companies.csv") val stockStreamDF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServer) .option("subscribe", topic).load() .select(from_json(col("value").cast("string"), schema).as("value")).select("value.*") val filteredStockStreamDF = stockStreamDF.join(companiesDF, "companyName")
  • 8. Stream-Stream Joins ● Supported since Apache Spark 2.3. ● Challenge: – At any point of time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs. – Any row received from one input stream can match with any future, yet-to-be- received row from the other input stream. ● Solution: – For both the input streams, we have to buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results. – It also automatically handle late, out-of-order data and can limit the state using watermarks.
  • 9. Inner Join ● Any kind of columns along with any kind of join conditions are supported. ● As the stream runs, the size of streaming state will keep growing indefinitely as all past input must be saved as any new input can match with any input from the past. ● To avoid an unbounded state, we have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state.
  • 10. Example Let’s say we want to join a stream of trading company names with another stream of stocks to filter out the stocks that a stock broker is interested in. val companies = spark.readStream. ... val stocks = spark.readStream. ... // Join with event-time constraints stocks.join( companies, expr(""" companyName = stockName AND stockInputTime >= companyTradingTime AND stockInputTime <= companyTradingTime + interval 20 seconds """) )
  • 11. Outer Join ● Similar to Inner Join, except that for Left & Right Outer Joins watermarking + event time constraints should be specified. ● Because for generating the NULL results in outer join, the engine must know when an input row is not going to match with anything in future.
  • 12. Example Let’s say we want to keep the information of the stocks which were not traded for future prospects. // Apply watermarks on event-time columns val companiesWithWatermark = companies.withWatermark("companiesTradingTime", "10 seconds") val stocksWithWatermark = stocks.withWatermark(”stockInputTime”, "20 seconds") // Join with event-time constraints stocksWithWatermark.join( companiesWithWatermark, expr(""" companyName = stockName AND stockInputTime >= companyTradingTime AND stockInputTime <= companyTradingTime + interval 20 seconds """), joinType = "leftOuter" )
  • 13. Support Matrix for Joins in Streaming Queries Left Input Right Input Join Type Supported Static Static All Types Yes Stream Static Inner Yes Left Outer Yes Right Outer No Full Outer No Static Stream Inner Yes Left Outer No Right Outer Yes Full Outer No Stream Stream Inner Yes Left Outer Yes (Conditionally) Right Outer Yes (Conditionally) Full Outer No
  • 14. Code/Package - Spark Structured Streaming Package
  • 15. References ● http://spark.apache.org/docs/latest/structured-streaming-programming-guid ● https://github.com/apache/spark/tree/master/examples/src/main/scala/org/a ● https://databricks.com/session/a-deep-dive-into-structured-streaming ● https://www.gitbook.com/book/jaceklaskowski/spark-structured-streaming/d