SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Arijit Tarafdar
Nan Zhu
Azure HDInsight @ Microsoft
Building Structured
Streaming Connector for
Continuous Applications
Who Are We?
• Nan Zhu
– Software Engineer@Azure HDInsight. Work on Spark
Streaming/Structured Streaming service in Azure. Committee
Member of XGBoost@DMLC and Apache MxNet (incubator).
Spark Contributor.
• Arijit Tarafdar
– Software Engineer@Azure HDInsight. Work on Spark/Spark
Streaming on Azure. Previously worked with other distributed
platforms like DryadLinq and MPI. Also worked on graph coloring
algorithms which was contributed to ADOL-C
(https://projects.coin-or.org/ADOL-C).
Agenda
• What is/Why ContinuousApplication?
– What are the challenges when we already have Spark in
hand?
• Introduction of Azure Event Hubs and Structured
Streaming
• Key Design Considerations in Building Structured
Streaming Connector
• Test Structured Streaming Connector
• Summary
Continuous Application Architecture
and Role of Spark Connectors (1)
• Not only size of data is increasing, but also the
velocity of data
– Sensors, IoT devices, social networks and online
transactions are all generating data that needs to be
monitored constantly and acted upon quickly.
Continuous Application Architecture
and Role of Spark Connectors (2)
Processing
Engine
Real-time
Data Source
Acting
Components
Continuous
Raw Data
Continuous
Insights
Block the fraud
transactions,
broadcast alerts, etc.
Sensors, IoT Devices,
Log Collectors, etc.
Spark Streaming,
Structured Streaming
How to connect
Spark and Real-time
Data Sources?
Building a Connector for
Real-time Data Source and
Structured Streaming
taking Azure Event Hubs as an example
(Comparing with Kafka Connector)
What is Azure Event Hubs?
Platform as a Service
What is Structured Streaming?
• A relatively new component in Spark
– Introduced in Spark 2.0
– Keep evolving from Spark 2.0 – 2.2
• Streaming Analytic Engine for Structured Data
– Sharing the same set of APIs with Spark SQL
Batching Engine
– Running computation over continuously arriving data
and keep updating results
Structured Streaming
Abstraction in Structured Streaming
Input
Batch
X, X+1,
...
Batch
X, X+1,
,…
(Spark 2.1
Implementation)
Source
FS
Source
EventHubs
Source
…
Describe Data Source:
1. getOffset: offset of the last
message to be processed in
next batch
2. getBatch: build DataFrame for
the next batch
Processing
Spark SQL Core
getOffset/get
Batch
User-Defined
Operations
Define Data
Transforming Task with
Spark SQL APIs
(Structured Streaming
Internal)
Output
Sink
FS
Sink
Console
Sink
…
Describe Data Sink
1. addBatch: evaluate
DataFrame and save
data to target system
Still follow Micro-Batch streaming model
Implementation of a “Source”
Description Implementation
getOffset()
Return the last offset of
next batch
EndOffsetOfLastBatch
“+” RateControlFunc(…)
getBatch(startOffset,
endOffset)
Build DataFrame for next
Batch
SS internal pases in
startOffset and endOffset
(results of getOffset)
1. How to represent Offset
2. How to define Rate (To avoid tracking size of each message, rate is
usually defined as # of messages)
Implementation of a “Source”
Description Implementation
getOffset()
Return the last offset of
next batch
EndOffsetOfLastBatch
“+” RateControlFunc(…)
getBatch(startOffset,
endOffset)
Build DataFrame for next
Batch
SS internal pases in
startOffset and endOffset
(results of getOffset)
1. How to represent Offset
2. How to define Rate (To avoid tracking size of each message, rate is
usually defined as # of messages)
Various Forms of Message Offset
• Consecutive Numbers
– 0, 1, 2, 3, …
– Examples: Kafka, MQTT
– Server-side index mapping an integer to the actual offset
of the message in disk space
• Real offset
– 0, sizeOf(msg0), sizeOf(msg0 + msg1), ...
– Examples: Azure Event Hubs
– No server-side index, passing offset as part of message to
user
How it brings difference? - Kafka
Driver
Executor
Executor
Kafka
Batch 0: Mapping the partition of DataFrame
to a offset range in server side:
(endOffsetOfLastBatch:0, # of msgs: 1000)
Batch 1: How many messages are to be
processed in next batch, and where to start?
(endOffsetOfLastBatch: 999, # of msgs: 1000)
How it brings difference? - Event Hubs
Driver
Executor
Executor
Batch 0: How many messages are to be
processed in next batch, and where to start?
(endOffsetOfLastBatch:0, # of Msgs: 1000)
Azure Event
Hubs
Batch 1: How many messages are to be
processed in next batch, and where to start?
(endOffsetOfLastBatch:?, endOffset: 1000)
What’s the offset of 1000th
message???
The answer appeared in Executor side
(when Task receives the message (offset
as part of message))
Build a Channel to Pass Information
from Executor to Driver!!!
Difference of KafkaSource and
EventHubsSource
getOffset EndOffsetOfLastBatch (known
before the batch is finished)
Calculate targetOffset
of next Batch
Collect the ending
offsets of last batch
from executors
getBatch(
startOffset,
endOffset)
StartOffset: EndOffsetOfLastBatch
(passed-in value from SS internal)
StartOffset: the collected values
Azure Event Hubs
More details about two-layered index to work with OffsetSeqLog in SS for offine discussion
or QA
How it brings difference? - Event Hubs
Driver
Executor
Executor
Batch 0: How many messages are to be
processed in next batch, and where to start?
(startOffset:0, # of Msgs: 1000)
Azure Event
Hubs
Batch 1: How many messages are to be
processed in next batch, and where to start?
(startOffset:?, endOffset: 1000)
What’s the offset of 1000th
message???
The answer appeared in Executor side
(when Task receives the message (offset
as part of message))
Build a Channel to Pass Information
from Executor to Driver!!!
HOW?
HDFS-Based Channel
LastOffset_P1_Batch_i
LastOffset_PN_Batch_i
Source
DataFrame
.map()
Transformed
DataFrame
What’s the next step??? Simply
let Driver-side logic read the files?
•APIs like DataFrame.head(x) evaluates only some of the
partitions…Batch 0 generate 3 files, and Batch 1 generates 5
files…
•You have to merge the latest files with the historical results and
commit and then direct the driver-side logic to read
No!!!
HDFS-Based Channel
LastOffset_P1_Batch_i
LastOffset_PN_Batch_i
EventHubsRDD
Tasks
.map()
MapPartitionsRDD
Tasks
•APIs like DataFrame.head(x) evaluates only some of the partitions…Batch 0
generate 3 files, and Batch 1 generates 5 files…
•You have to merge the latest files with the historical results and commit...
•Ensure that all streams’ offset are committed transactionally
•Discard the partially merged/committed results to rerun the batch
Takeaways (1)
• Data Source is not only designed for structured
streaming
• Accommodate Connector Implementation with Data
Source Design
• Message Addressing (Offset) in data source side is
the Key Design Factor to be considered
– Requirement of Additional Information Sharing Channel
between Executors and Driver
• Design to ensure the correctness of the channel
Structured Streaming Unit Testing
Framework
• “Action” based structured streaming test flow –
StartStream, AddData, CheckAnswer,
AdvanceClock, StopStream, etc.
• Kafka source unit tests use Kafka micro-service.
• Unit test framework not usable as is.
Unit Testing Event Hubs Source
• No Kafka type micro-service available for Event
Hubs.
• Two simulated clients – Event Hubs AMQP and
Event Hubs REST clients.
• Replace both clients before first call to the Event
Hubs source.
Issues With Unit Test Framework
• No separation between test setup and test
execution.
• Unreliable update of the Event Hubs clients
before asynchronous call to
StreamingQueryManager.startQuery().
• Task serialization issue with AddData as part of
StreamTest.
What Is The Asynchrony Problem?
StreamingQueryManager.startQuery()
: StreamingQuery
StreamingQueryManager.createQuery()
: StreamingQueryWrapper
StreamingQueryWrapper
.StreamExecution.start()
EventHubsSource.setEventHubsClient()
EventHubsSource
.setEventHubsReceiver()
EventHubs Test Code
Spark Source Code
Thread 1
Thread 2 spawned by Thread 1
One Way To Solve Asynchrony Problem
StreamingQueryManager.startQuery()
: StreamingQuery
StreamingQueryManager.createQuery()
: StreamingQueryWrapper
StreamingQueryWrapper
.StreamExecution.start()
EventHubsSource.setEventHubsClient()
EventHubsSource
.setEventHubsReceiver()
EventHubs Test Code
Spark Source Code
Thread 1
Thread 2 spawned by Thread 1
Solving Asynchrony in Client Substitutions (1)
• Invoke createQuery
val createQueryMethod =
sparkSession.streams.getClass.getDeclaredMethods.filter(m =>
m.getName == "createQuery").head
createQueryMethod.setAccessible(true)
currentStream = createQueryMethod.invoke(…)
• Set as active query
val activeQueriesField =
sparkSession.streams.getClass.getDeclaredFields.filter(f => f.getName ==
"org$apache$spark$sql$streaming$StreamingQueryManager$$activeQueries").head
activeQueriesField.setAccessible(true)
val activeQueries = activeQueriesField.get(sparkSession.streams).
asInstanceOf[mutable.HashMap[UUID, StreamingQuery]]
activeQueries += currentStream.id -> currentStream
Solve Asynchrony in Client Substitutions (2)
• Substitute Event Hubs clients in source
val eventHubsSource = sources.head
eventHubsSource.setEventHubClient(…)
eventHubsSource.setEventHubsReceiver(…)
• Start StreamExecution
currentStream.start()
What Is The Task Serialization Issue?
SimulatedEventHubs
StreamTest
AddEventHubsData
AddData
TestEventHubsReceiver
TestEventHubsReceiver
TestEventHubsReceiver
TestEventHubsReceiver
TestEventHubsReceiver
Partition
Partition
Partition
Partition
Partition
EventHubsRDD
What Is StreamTest Inheritance Model?
SparkFunSuite
StreamTest
PlanTest
QueryTest
FunSuite
Solving Task Serialization Issue
• Separate StreamTest into two traits:
– EventHubsStreamTest
trait EventHubsStreamTest extends QueryTest with BeforeAndAfter with
SharedSQLContext with Timeouts with Serializable { }
– EventHubsAddData
trait EventHubsAddData extends StreamAction with Serializable {}
After All These Changes…
testStream(sourceQuery)(
StartStream(…),
CheckAnswer(…),
AddEventHubsData(…),
AdvanceManualClock(…),
CheckAnswer(…),
StopStream,
StartStream(…),
CheckAnswer(…),
AddEventHubsData(…),
AdvanceManualClock(…),
AdvanceManualClock(…),
CheckAnswer(…))
Summary/Takeaway(2)
• We have a production grade Spark Streaming connector for Structured
Streaming for the widely used Event Hubs as streaming source on Azure
(https://github.com/hdinsight/spark-eventhubs)
• Design, implementation and testing of Structured Streaming Connectors
o Accommodate Connector Implementation with Data Source Design
(message addressing)
o Unit Test with Simulated Service
o Asynchrony – the biggest enemy to a easy test
o Clear Boundary between test setup and test execution
Contributing Back To Community
• Failed Recovery from checkpoint caused by the multi-
threads issue in Spark Streaming scheduler
https://issues.apache.org/jira/browse/SPARK-19280
One Realistic Example of its Impact: You are potentially getting
wrong data when you use Kafka and reduceByWindow and
recover from a failure
• Data loss caused by improper post-batch-completed
processing
https://issues.apache.org/jira/browse/SPARK-18905
• Inconsistent Behavior of Spark Streaming Checkpoint
https://issues.apache.org/jira/browse/SPARK-19233
Thank You!!!
https://github.com/hdinsight/spark-eventhubs
https://azure.microsoft.com/en-us/services/hdinsight/

Weitere ähnliche Inhalte

Was ist angesagt?

What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineDatabricks
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkDatabricks
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 

Was ist angesagt? (20)

What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 

Ähnlich wie Building Continuous Application with Structured Streaming and Real-Time Data Source with Arijit Tarafdar and Nan Zhu

Sherlock Homepage - A detective story about running large web services - NDC ...
Sherlock Homepage - A detective story about running large web services - NDC ...Sherlock Homepage - A detective story about running large web services - NDC ...
Sherlock Homepage - A detective story about running large web services - NDC ...Maarten Balliauw
 
Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Visug
 
Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Maarten Balliauw
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...MSDEVMTL
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsExpertos en TI
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw
 
Scaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersScaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersoazabir
 
## Introducing a reactive Scala-Akka based system in a Java centric company
## Introducing a reactive Scala-Akka based system in a Java centric company## Introducing a reactive Scala-Akka based system in a Java centric company
## Introducing a reactive Scala-Akka based system in a Java centric companyMilan Aleksić
 
What's New in .Net 4.5
What's New in .Net 4.5What's New in .Net 4.5
What's New in .Net 4.5Malam Team
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 

Ähnlich wie Building Continuous Application with Structured Streaming and Real-Time Data Source with Arijit Tarafdar and Nan Zhu (20)

Sherlock Homepage - A detective story about running large web services - NDC ...
Sherlock Homepage - A detective story about running large web services - NDC ...Sherlock Homepage - A detective story about running large web services - NDC ...
Sherlock Homepage - A detective story about running large web services - NDC ...
 
Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)
 
Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
 
GCF
GCFGCF
GCF
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS Applications
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
 
Scaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersScaling asp.net websites to millions of users
Scaling asp.net websites to millions of users
 
## Introducing a reactive Scala-Akka based system in a Java centric company
## Introducing a reactive Scala-Akka based system in a Java centric company## Introducing a reactive Scala-Akka based system in a Java centric company
## Introducing a reactive Scala-Akka based system in a Java centric company
 
What's New in .Net 4.5
What's New in .Net 4.5What's New in .Net 4.5
What's New in .Net 4.5
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 

Kürzlich hochgeladen (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 

Building Continuous Application with Structured Streaming and Real-Time Data Source with Arijit Tarafdar and Nan Zhu

  • 1. Arijit Tarafdar Nan Zhu Azure HDInsight @ Microsoft Building Structured Streaming Connector for Continuous Applications
  • 2. Who Are We? • Nan Zhu – Software Engineer@Azure HDInsight. Work on Spark Streaming/Structured Streaming service in Azure. Committee Member of XGBoost@DMLC and Apache MxNet (incubator). Spark Contributor. • Arijit Tarafdar – Software Engineer@Azure HDInsight. Work on Spark/Spark Streaming on Azure. Previously worked with other distributed platforms like DryadLinq and MPI. Also worked on graph coloring algorithms which was contributed to ADOL-C (https://projects.coin-or.org/ADOL-C).
  • 3. Agenda • What is/Why ContinuousApplication? – What are the challenges when we already have Spark in hand? • Introduction of Azure Event Hubs and Structured Streaming • Key Design Considerations in Building Structured Streaming Connector • Test Structured Streaming Connector • Summary
  • 4. Continuous Application Architecture and Role of Spark Connectors (1) • Not only size of data is increasing, but also the velocity of data – Sensors, IoT devices, social networks and online transactions are all generating data that needs to be monitored constantly and acted upon quickly.
  • 5. Continuous Application Architecture and Role of Spark Connectors (2) Processing Engine Real-time Data Source Acting Components Continuous Raw Data Continuous Insights Block the fraud transactions, broadcast alerts, etc. Sensors, IoT Devices, Log Collectors, etc. Spark Streaming, Structured Streaming How to connect Spark and Real-time Data Sources?
  • 6. Building a Connector for Real-time Data Source and Structured Streaming taking Azure Event Hubs as an example (Comparing with Kafka Connector)
  • 7. What is Azure Event Hubs? Platform as a Service
  • 8. What is Structured Streaming? • A relatively new component in Spark – Introduced in Spark 2.0 – Keep evolving from Spark 2.0 – 2.2 • Streaming Analytic Engine for Structured Data – Sharing the same set of APIs with Spark SQL Batching Engine – Running computation over continuously arriving data and keep updating results
  • 9. Structured Streaming Abstraction in Structured Streaming Input Batch X, X+1, ... Batch X, X+1, ,… (Spark 2.1 Implementation) Source FS Source EventHubs Source … Describe Data Source: 1. getOffset: offset of the last message to be processed in next batch 2. getBatch: build DataFrame for the next batch Processing Spark SQL Core getOffset/get Batch User-Defined Operations Define Data Transforming Task with Spark SQL APIs (Structured Streaming Internal) Output Sink FS Sink Console Sink … Describe Data Sink 1. addBatch: evaluate DataFrame and save data to target system Still follow Micro-Batch streaming model
  • 10. Implementation of a “Source” Description Implementation getOffset() Return the last offset of next batch EndOffsetOfLastBatch “+” RateControlFunc(…) getBatch(startOffset, endOffset) Build DataFrame for next Batch SS internal pases in startOffset and endOffset (results of getOffset) 1. How to represent Offset 2. How to define Rate (To avoid tracking size of each message, rate is usually defined as # of messages)
  • 11. Implementation of a “Source” Description Implementation getOffset() Return the last offset of next batch EndOffsetOfLastBatch “+” RateControlFunc(…) getBatch(startOffset, endOffset) Build DataFrame for next Batch SS internal pases in startOffset and endOffset (results of getOffset) 1. How to represent Offset 2. How to define Rate (To avoid tracking size of each message, rate is usually defined as # of messages)
  • 12. Various Forms of Message Offset • Consecutive Numbers – 0, 1, 2, 3, … – Examples: Kafka, MQTT – Server-side index mapping an integer to the actual offset of the message in disk space • Real offset – 0, sizeOf(msg0), sizeOf(msg0 + msg1), ... – Examples: Azure Event Hubs – No server-side index, passing offset as part of message to user
  • 13. How it brings difference? - Kafka Driver Executor Executor Kafka Batch 0: Mapping the partition of DataFrame to a offset range in server side: (endOffsetOfLastBatch:0, # of msgs: 1000) Batch 1: How many messages are to be processed in next batch, and where to start? (endOffsetOfLastBatch: 999, # of msgs: 1000)
  • 14. How it brings difference? - Event Hubs Driver Executor Executor Batch 0: How many messages are to be processed in next batch, and where to start? (endOffsetOfLastBatch:0, # of Msgs: 1000) Azure Event Hubs Batch 1: How many messages are to be processed in next batch, and where to start? (endOffsetOfLastBatch:?, endOffset: 1000) What’s the offset of 1000th message??? The answer appeared in Executor side (when Task receives the message (offset as part of message)) Build a Channel to Pass Information from Executor to Driver!!!
  • 15. Difference of KafkaSource and EventHubsSource getOffset EndOffsetOfLastBatch (known before the batch is finished) Calculate targetOffset of next Batch Collect the ending offsets of last batch from executors getBatch( startOffset, endOffset) StartOffset: EndOffsetOfLastBatch (passed-in value from SS internal) StartOffset: the collected values Azure Event Hubs More details about two-layered index to work with OffsetSeqLog in SS for offine discussion or QA
  • 16. How it brings difference? - Event Hubs Driver Executor Executor Batch 0: How many messages are to be processed in next batch, and where to start? (startOffset:0, # of Msgs: 1000) Azure Event Hubs Batch 1: How many messages are to be processed in next batch, and where to start? (startOffset:?, endOffset: 1000) What’s the offset of 1000th message??? The answer appeared in Executor side (when Task receives the message (offset as part of message)) Build a Channel to Pass Information from Executor to Driver!!! HOW?
  • 17. HDFS-Based Channel LastOffset_P1_Batch_i LastOffset_PN_Batch_i Source DataFrame .map() Transformed DataFrame What’s the next step??? Simply let Driver-side logic read the files? •APIs like DataFrame.head(x) evaluates only some of the partitions…Batch 0 generate 3 files, and Batch 1 generates 5 files… •You have to merge the latest files with the historical results and commit and then direct the driver-side logic to read No!!!
  • 18. HDFS-Based Channel LastOffset_P1_Batch_i LastOffset_PN_Batch_i EventHubsRDD Tasks .map() MapPartitionsRDD Tasks •APIs like DataFrame.head(x) evaluates only some of the partitions…Batch 0 generate 3 files, and Batch 1 generates 5 files… •You have to merge the latest files with the historical results and commit... •Ensure that all streams’ offset are committed transactionally •Discard the partially merged/committed results to rerun the batch
  • 19. Takeaways (1) • Data Source is not only designed for structured streaming • Accommodate Connector Implementation with Data Source Design • Message Addressing (Offset) in data source side is the Key Design Factor to be considered – Requirement of Additional Information Sharing Channel between Executors and Driver • Design to ensure the correctness of the channel
  • 20. Structured Streaming Unit Testing Framework • “Action” based structured streaming test flow – StartStream, AddData, CheckAnswer, AdvanceClock, StopStream, etc. • Kafka source unit tests use Kafka micro-service. • Unit test framework not usable as is.
  • 21. Unit Testing Event Hubs Source • No Kafka type micro-service available for Event Hubs. • Two simulated clients – Event Hubs AMQP and Event Hubs REST clients. • Replace both clients before first call to the Event Hubs source.
  • 22. Issues With Unit Test Framework • No separation between test setup and test execution. • Unreliable update of the Event Hubs clients before asynchronous call to StreamingQueryManager.startQuery(). • Task serialization issue with AddData as part of StreamTest.
  • 23. What Is The Asynchrony Problem? StreamingQueryManager.startQuery() : StreamingQuery StreamingQueryManager.createQuery() : StreamingQueryWrapper StreamingQueryWrapper .StreamExecution.start() EventHubsSource.setEventHubsClient() EventHubsSource .setEventHubsReceiver() EventHubs Test Code Spark Source Code Thread 1 Thread 2 spawned by Thread 1
  • 24. One Way To Solve Asynchrony Problem StreamingQueryManager.startQuery() : StreamingQuery StreamingQueryManager.createQuery() : StreamingQueryWrapper StreamingQueryWrapper .StreamExecution.start() EventHubsSource.setEventHubsClient() EventHubsSource .setEventHubsReceiver() EventHubs Test Code Spark Source Code Thread 1 Thread 2 spawned by Thread 1
  • 25. Solving Asynchrony in Client Substitutions (1) • Invoke createQuery val createQueryMethod = sparkSession.streams.getClass.getDeclaredMethods.filter(m => m.getName == "createQuery").head createQueryMethod.setAccessible(true) currentStream = createQueryMethod.invoke(…) • Set as active query val activeQueriesField = sparkSession.streams.getClass.getDeclaredFields.filter(f => f.getName == "org$apache$spark$sql$streaming$StreamingQueryManager$$activeQueries").head activeQueriesField.setAccessible(true) val activeQueries = activeQueriesField.get(sparkSession.streams). asInstanceOf[mutable.HashMap[UUID, StreamingQuery]] activeQueries += currentStream.id -> currentStream
  • 26. Solve Asynchrony in Client Substitutions (2) • Substitute Event Hubs clients in source val eventHubsSource = sources.head eventHubsSource.setEventHubClient(…) eventHubsSource.setEventHubsReceiver(…) • Start StreamExecution currentStream.start()
  • 27. What Is The Task Serialization Issue? SimulatedEventHubs StreamTest AddEventHubsData AddData TestEventHubsReceiver TestEventHubsReceiver TestEventHubsReceiver TestEventHubsReceiver TestEventHubsReceiver Partition Partition Partition Partition Partition EventHubsRDD
  • 28. What Is StreamTest Inheritance Model? SparkFunSuite StreamTest PlanTest QueryTest FunSuite
  • 29. Solving Task Serialization Issue • Separate StreamTest into two traits: – EventHubsStreamTest trait EventHubsStreamTest extends QueryTest with BeforeAndAfter with SharedSQLContext with Timeouts with Serializable { } – EventHubsAddData trait EventHubsAddData extends StreamAction with Serializable {}
  • 30. After All These Changes… testStream(sourceQuery)( StartStream(…), CheckAnswer(…), AddEventHubsData(…), AdvanceManualClock(…), CheckAnswer(…), StopStream, StartStream(…), CheckAnswer(…), AddEventHubsData(…), AdvanceManualClock(…), AdvanceManualClock(…), CheckAnswer(…))
  • 31. Summary/Takeaway(2) • We have a production grade Spark Streaming connector for Structured Streaming for the widely used Event Hubs as streaming source on Azure (https://github.com/hdinsight/spark-eventhubs) • Design, implementation and testing of Structured Streaming Connectors o Accommodate Connector Implementation with Data Source Design (message addressing) o Unit Test with Simulated Service o Asynchrony – the biggest enemy to a easy test o Clear Boundary between test setup and test execution
  • 32. Contributing Back To Community • Failed Recovery from checkpoint caused by the multi- threads issue in Spark Streaming scheduler https://issues.apache.org/jira/browse/SPARK-19280 One Realistic Example of its Impact: You are potentially getting wrong data when you use Kafka and reduceByWindow and recover from a failure • Data loss caused by improper post-batch-completed processing https://issues.apache.org/jira/browse/SPARK-18905 • Inconsistent Behavior of Spark Streaming Checkpoint https://issues.apache.org/jira/browse/SPARK-19233