SlideShare a Scribd company logo
1 of 13
Download to read offline
Structured Streaming
Spark Streaming 2.0
https://hadoopist.wordpress.com
Giri R Varatharajan
https://www.linkedin.com/in/girivaratharajan
What is
Structured
Streaming in
Apache Spark
● Continuous Data Flow Programming Model in
Spark introduced in 2.0
● Low Tolerance & High Throughput System
● Exactly Once Semantic - No Duplicates
● Stateful Aggregation over the Time, Event,
Window, Record.
● A Streaming platform built on top of Spark SQL
● Express your the computational code as your
batch computational code in Spark SQL
Dataframes
● Alpha Release released with Spark 2.0
● Supports HDFS, S3 now and support for Kafka,
Kinesis and Other Sources very soon.
Spark
Streaming
< 2.0
Behavior
● Micro Batching : streams are called as Discretized
Streams (DStreams)
● Running Aggregations needs to be specified with
a updateStateByKey method
● Requires careful construction of fault tolerance.
Micro
Batching
Streaming Model
● Live Data Streams Keep appending
to the Dataframe called Unbounded
table.
● Runs incremental aggregates on the
Unbounded table.
Spark
Streaming
2.0
Behavior
+
Demo
● Continuous Data Flow : Streams are appended in
an Unbounded Table with Dataframes APIs on it.
● No need to specify any method for running
aggregates over the time, window, or record.
● Look at the network socket wordcount program.
● Streaming is performed in Complete, Append,
Update Mode(s)
Continuous
Data Flow
Lines = Input Table
wordCounts = Result Table
Streaming Model
//Socket Stream - Read as and when it arrives in NetCat Channel
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
Streaming Model
val windowedCounts = words.groupBy(
window($"timestamp", windowDuration,
slideDuration), $"word"
).count().orderBy("window")
Create/
Read Streams
SparkSession.readStream()
● File Source (HDFS, S3, Text, Parquet, Csv,
Json,etc.)
● Socket Stream (NetCat)
● Kafka, Kinesis and Other Input Sources are Under
Research so cross your fingers.
● DataStreamReader API
(http://spark.apache.org/docs/latest/api/scala/index
.html#org.apache.spark.sql.streaming.DataStream
Reader)
Outputting
Streams
SparkSession.writeStream()
Output Sink Types:
● Parquet Sink - HDFS, S3, Parquet
● Console Sink - Terminal
● Memory Sink - In memory table that can be queried over time interactively
● Foreach Sink
● DataStreamWriter
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.st
reaming.DataStreamWriter)
Output Modes:
● Append Mode(Default)
○ New rows only appended
○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc)
● Complete Mode
■ Output the whole result to any Sink
■ Applicable only for aggregated Queries (groupBy, etc)
● Update Mode
○ Updates on any of the row attributes will get appended to the output sink.
CheckPointing ● In case of Failure recover the previous progress
and state of a previous query, and continue where
it left off.
● Configure a CheckPoint location in writeStream
method of DataStreamWriter
● Must be configured for Parquet Sink, File Sink.
Unsupported
Operations yet
● Sort, Limit of First N rows, Distinct on Input
Streams
● Joins bt two streaming datasets
● Outer Joins (FO, LO, RO) bt two streaming
datasets.
● ds.count() ⇒ Use ds.groupBy.count() instead
Key Takeaways
● Structured Streaming is still experimental but please try it out.
● Streaming Events are gathered and appended to a infinite
dataframe series (Unbounded Table) and queries are running on
top of that.
● Development is very similar to the development of Spark for
Static Dataframe/DataSets APIs.
● Execute Ad-hoc Queries, Run aggregates, update DBs, track
session data, prepare dashboards,etc.
● readStream() - Schema of the Streaming Dataframes are
checked only at run time hence it’s untyped.
● writeStream() with various Output Modes, Output Sinks are
available. Always remember when to use what types of Output
Mode.
● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks
are the upcoming features and are being developed at the open
source community.
● Structured Streaming is not recommended for Production
workloads at this point even if it’s a File Streaming, Socket
Streaming.
Thank You Spark Code is available in my github:
https://github.com/vgiri2015/Spark2.0-and-greater/tree/master
/src/main/scala/structStreaming
Other Spark related repositories:
https://github.com/vgiri2015/spark-latest-v1
My blogs and Learning in Spark:
https://hadoopist.wordpress.com/category/apache-spark/

More Related Content

What's hot

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 

What's hot (20)

Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Structured streaming for machine learning
Structured streaming for machine learningStructured streaming for machine learning
Structured streaming for machine learning
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
 

Similar to Structured streaming in Spark

Similar to Structured streaming in Spark (20)

Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays Singapore
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 

Recently uploaded

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 

Structured streaming in Spark

  • 1. Structured Streaming Spark Streaming 2.0 https://hadoopist.wordpress.com Giri R Varatharajan https://www.linkedin.com/in/girivaratharajan
  • 2. What is Structured Streaming in Apache Spark ● Continuous Data Flow Programming Model in Spark introduced in 2.0 ● Low Tolerance & High Throughput System ● Exactly Once Semantic - No Duplicates ● Stateful Aggregation over the Time, Event, Window, Record. ● A Streaming platform built on top of Spark SQL ● Express your the computational code as your batch computational code in Spark SQL Dataframes ● Alpha Release released with Spark 2.0 ● Supports HDFS, S3 now and support for Kafka, Kinesis and Other Sources very soon.
  • 3. Spark Streaming < 2.0 Behavior ● Micro Batching : streams are called as Discretized Streams (DStreams) ● Running Aggregations needs to be specified with a updateStateByKey method ● Requires careful construction of fault tolerance. Micro Batching
  • 4. Streaming Model ● Live Data Streams Keep appending to the Dataframe called Unbounded table. ● Runs incremental aggregates on the Unbounded table.
  • 5. Spark Streaming 2.0 Behavior + Demo ● Continuous Data Flow : Streams are appended in an Unbounded Table with Dataframes APIs on it. ● No need to specify any method for running aggregates over the time, window, or record. ● Look at the network socket wordcount program. ● Streaming is performed in Complete, Append, Update Mode(s) Continuous Data Flow Lines = Input Table wordCounts = Result Table
  • 6. Streaming Model //Socket Stream - Read as and when it arrives in NetCat Channel val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
  • 7. Streaming Model val windowedCounts = words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word" ).count().orderBy("window")
  • 8. Create/ Read Streams SparkSession.readStream() ● File Source (HDFS, S3, Text, Parquet, Csv, Json,etc.) ● Socket Stream (NetCat) ● Kafka, Kinesis and Other Input Sources are Under Research so cross your fingers. ● DataStreamReader API (http://spark.apache.org/docs/latest/api/scala/index .html#org.apache.spark.sql.streaming.DataStream Reader)
  • 9. Outputting Streams SparkSession.writeStream() Output Sink Types: ● Parquet Sink - HDFS, S3, Parquet ● Console Sink - Terminal ● Memory Sink - In memory table that can be queried over time interactively ● Foreach Sink ● DataStreamWriter API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.st reaming.DataStreamWriter) Output Modes: ● Append Mode(Default) ○ New rows only appended ○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc) ● Complete Mode ■ Output the whole result to any Sink ■ Applicable only for aggregated Queries (groupBy, etc) ● Update Mode ○ Updates on any of the row attributes will get appended to the output sink.
  • 10. CheckPointing ● In case of Failure recover the previous progress and state of a previous query, and continue where it left off. ● Configure a CheckPoint location in writeStream method of DataStreamWriter ● Must be configured for Parquet Sink, File Sink.
  • 11. Unsupported Operations yet ● Sort, Limit of First N rows, Distinct on Input Streams ● Joins bt two streaming datasets ● Outer Joins (FO, LO, RO) bt two streaming datasets. ● ds.count() ⇒ Use ds.groupBy.count() instead
  • 12. Key Takeaways ● Structured Streaming is still experimental but please try it out. ● Streaming Events are gathered and appended to a infinite dataframe series (Unbounded Table) and queries are running on top of that. ● Development is very similar to the development of Spark for Static Dataframe/DataSets APIs. ● Execute Ad-hoc Queries, Run aggregates, update DBs, track session data, prepare dashboards,etc. ● readStream() - Schema of the Streaming Dataframes are checked only at run time hence it’s untyped. ● writeStream() with various Output Modes, Output Sinks are available. Always remember when to use what types of Output Mode. ● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks are the upcoming features and are being developed at the open source community. ● Structured Streaming is not recommended for Production workloads at this point even if it’s a File Streaming, Socket Streaming.
  • 13. Thank You Spark Code is available in my github: https://github.com/vgiri2015/Spark2.0-and-greater/tree/master /src/main/scala/structStreaming Other Spark related repositories: https://github.com/vgiri2015/spark-latest-v1 My blogs and Learning in Spark: https://hadoopist.wordpress.com/category/apache-spark/