1. Streaming Systems – Part 2
Sandeep Malhotra
techBLEND Group Presentation Series, January 3, 2020
Please refer to Part 1 of this presentation series for streaming basics and message queues
2. Stream Processing Challenges
• Processing data, as it arrives, with limited amount of computing
resources
• Uncertainty in input arrival and hence challenge in predicting peak
load
• Skew in the time of data generation and data arrival
• In a distributed system, splitting up the input stream into partitions
leads to each executor getting to see only a partial view of the
complete stream
Streaming Systems - Part 2 by Sandeep Malhotra 203/01/21
3. Window Aggregation
• A window represents a certain amount of data over a certain time
interval that we can perform computations on
Events
Time
Time Window
Duration
Streaming Systems - Part 2 by Sandeep Malhotra 303/01/21
4. Tumbling Window
• Grouping function each x period of time
• Time periods are inherently consecutive and nonoverlapping
• Used when we need to produce aggregates of our data over regular
periods of time, with each period independent from previous periods
Events
Time
Window k + 1Window k Window k + 2
Streaming Systems - Part 2 by Sandeep Malhotra 403/01/21
5. Sliding Window
• Grouping function over a time interval x reported each y frequency
• Time periods are overlapping
Events
TimeWindow k + 1
Window k
Window k + 2
Streaming Systems - Part 2 by Sandeep Malhotra 503/01/21
6. Sessions
• Sequences of events terminated by a gap of inactivity greater than
some timeout
Events
Time
Window k Window k + 1
Streaming Systems - Part 2 by Sandeep Malhotra 603/01/21
7. Handling Time
• Event Time
• time at which events actually occurred
• Processing Time
• time at which events are observed in the system
The skew between event time and processing time is:
• Non-zero
• Depends on the characteristics of the underlying input sources, execution
engine, hardware etc.
Event Time
ProcessingTime
Processing
Time Lag
Event Time
Skew
Streaming Systems - Part 2 by Sandeep Malhotra 703/01/21
8. Windowing by Processing Time
Event Time
ProcessingTime
• Window boundaries are well
defined
• Window contents are unrelated to
when the events were generated
Streaming Systems - Part 2 by Sandeep Malhotra 803/01/21
9. Windowing by Event Time
Event Time
ProcessingTime
• Window contents related to when
the events were generated
• No natural upper boundary that
defines when the window end
• Events can come late
• Events may not arrive at all
Streaming Systems - Part 2 by Sandeep Malhotra 903/01/21
10. Watermarks
Event Time
ProcessingTime
• The oldest timestamp that we will accept on
the data stream, at any given moment
• Usually much larger than the average delay
we expect in the delivery of the events
• Closes the open boundary left by the
definition of event-time window
Streaming Systems - Part 2 by Sandeep Malhotra 1003/01/21
11. Watermarks (contd.)
• Outputs are delayed for at least the length of the watermark
• Stream processor needs to store a lot of intermediate data and, as
such, consume a significant amount of memory that roughly
corresponds to
• the length of the watermark × the rate of arrival × message size
A too small watermark => Too many events are dropped and may
produce severely incomplete results.
A too large watermark => Increased latency and resource needs
Streaming Systems - Part 2 by Sandeep Malhotra 1103/01/21
12. State Management
• Dependencies on previous message(s) and/or external data
• Two ways to maintain state
• Handle it yourself
• Use the state management services provided by your framework
• Can range from
• In-memory
• For the simple operations
• Replicated queryable persistent storage
• Helps answer complicated questions
• Enables joining together different streams of data
Streaming Systems - Part 2 by Sandeep Malhotra 1203/01/21
13. Message Delivery Semantics
• At most once
• At least once
• Exactly once
Streaming Systems - Part 2 by Sandeep Malhotra 1303/01/21
14. Fault Tolerance
• Data loss
• Data lost
• on the network
• Because of stream processor or your job crashing
• Two common approaches
• state-machine
• the stream manager replicates the streaming job on independent nodes
• rollback recovery
• the stream processor periodically packages the state of our computation into what is called
a checkpoint
• Loss of Resource Management
• Streaming manager
• Application driver
Streaming Systems - Part 2 by Sandeep Malhotra 1403/01/21
15. Approaches to Stream Processing
• Micro-batching
• Processing is done on a batch of records at fixed intervals that better the real-
time notion of data processing
• Higher latency
• Gives an opportunity for optimization
• One-element-at-a-time
• Processing is done as soon as a record is received
• Almost real-time
Streaming Systems - Part 2 by Sandeep Malhotra 1503/01/21
16. Stream Processing Model
Data Source
Stream
Processing
System Output Stream
(Data Sink)
Event Stream
(Data Source)
Streaming Systems - Part 2 by Sandeep Malhotra 1603/01/21
17. Distributed Stream Processing Architecture
Application
Driver
Streaming
Manager
Stream
Processor
Stream
Processor
Stream
Processor
Data
Source/Sink
Data
Source/Sink
Data
Source/Sink
Streaming Systems - Part 2 by Sandeep Malhotra 1703/01/21
18. Stream Processing Frameworks
• Samza
• Storm
• Spark Streaming
• Flink
• Kafka Streams
• Kinesis Analytics
Streaming Systems - Part 2 by Sandeep Malhotra 1803/01/21
19. Spark High Level Architecture
Spark Driver
(inside spark application,
contains spark session)
Cluster
Manager
Spark
Executor
Spark
Executor
Spark
Executor
Data
Source/Sink
Data
Source/Sink
Data
Source/Sink
Streaming Systems - Part 2 by Sandeep Malhotra 1903/01/21
20. Spark Stream APIs
• Spark Streaming (DStream) API
• Computation is done on small batches of data collected from a stream in the form of
microbatches spaced at fixed time intervals
• RDD Based
• Structured Streaming API
• Offers the notion of continuous queries over an unbounded table that is constantly
updated with fresh records from the stream
• Dataframe Based
• SQL Query optimization support
Both stream APIs take the approach of functional programming - they
declare the transformations and aggregations they operate on data streams,
assuming that those streams are immutable
Streaming Systems - Part 2 by Sandeep Malhotra 2003/01/21
21. Spark Streaming Model
Read
(Streaming Source)
Process
(Transform/Aggregate)
Write
(Streaming Sink)
Micro-batch
Streaming Systems - Part 2 by Sandeep Malhotra 2103/01/21
22. Structured Streaming Sources
• Socket Source
• Rate Source
• internal stream generator that produces a sequence of records at a
configurable frequency
• File Source
• Multiple format are supported like csv, json, parquet etc.
• Kafka Source
Streaming Systems - Part 2 by Sandeep Malhotra 2203/01/21