SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Streaming Systems – Part 2
Sandeep Malhotra
techBLEND Group Presentation Series, January 3, 2020
Please refer to Part 1 of this presentation series for streaming basics and message queues
Stream Processing Challenges
• Processing data, as it arrives, with limited amount of computing
resources
• Uncertainty in input arrival and hence challenge in predicting peak
load
• Skew in the time of data generation and data arrival
• In a distributed system, splitting up the input stream into partitions
leads to each executor getting to see only a partial view of the
complete stream
Streaming Systems - Part 2 by Sandeep Malhotra 203/01/21
Window Aggregation
• A window represents a certain amount of data over a certain time
interval that we can perform computations on
Events
Time
Time Window
Duration
Streaming Systems - Part 2 by Sandeep Malhotra 303/01/21
Tumbling Window
• Grouping function each x period of time
• Time periods are inherently consecutive and nonoverlapping
• Used when we need to produce aggregates of our data over regular
periods of time, with each period independent from previous periods
Events
Time
Window k + 1Window k Window k + 2
Streaming Systems - Part 2 by Sandeep Malhotra 403/01/21
Sliding Window
• Grouping function over a time interval x reported each y frequency
• Time periods are overlapping
Events
TimeWindow k + 1
Window k
Window k + 2
Streaming Systems - Part 2 by Sandeep Malhotra 503/01/21
Sessions
• Sequences of events terminated by a gap of inactivity greater than
some timeout
Events
Time
Window k Window k + 1
Streaming Systems - Part 2 by Sandeep Malhotra 603/01/21
Handling Time
• Event Time
• time at which events actually occurred
• Processing Time
• time at which events are observed in the system
The skew between event time and processing time is:
• Non-zero
• Depends on the characteristics of the underlying input sources, execution
engine, hardware etc.
Event Time
ProcessingTime
Processing
Time Lag
Event Time
Skew
Streaming Systems - Part 2 by Sandeep Malhotra 703/01/21
Windowing by Processing Time
Event Time
ProcessingTime
• Window boundaries are well
defined
• Window contents are unrelated to
when the events were generated
Streaming Systems - Part 2 by Sandeep Malhotra 803/01/21
Windowing by Event Time
Event Time
ProcessingTime
• Window contents related to when
the events were generated
• No natural upper boundary that
defines when the window end
• Events can come late
• Events may not arrive at all
Streaming Systems - Part 2 by Sandeep Malhotra 903/01/21
Watermarks
Event Time
ProcessingTime
• The oldest timestamp that we will accept on
the data stream, at any given moment
• Usually much larger than the average delay
we expect in the delivery of the events
• Closes the open boundary left by the
definition of event-time window
Streaming Systems - Part 2 by Sandeep Malhotra 1003/01/21
Watermarks (contd.)
• Outputs are delayed for at least the length of the watermark
• Stream processor needs to store a lot of intermediate data and, as
such, consume a significant amount of memory that roughly
corresponds to
• the length of the watermark × the rate of arrival × message size
A too small watermark => Too many events are dropped and may
produce severely incomplete results.
A too large watermark => Increased latency and resource needs
Streaming Systems - Part 2 by Sandeep Malhotra 1103/01/21
State Management
• Dependencies on previous message(s) and/or external data
• Two ways to maintain state
• Handle it yourself
• Use the state management services provided by your framework
• Can range from
• In-memory
• For the simple operations
• Replicated queryable persistent storage
• Helps answer complicated questions
• Enables joining together different streams of data
Streaming Systems - Part 2 by Sandeep Malhotra 1203/01/21
Message Delivery Semantics
• At most once
• At least once
• Exactly once
Streaming Systems - Part 2 by Sandeep Malhotra 1303/01/21
Fault Tolerance
• Data loss
• Data lost
• on the network
• Because of stream processor or your job crashing
• Two common approaches
• state-machine
• the stream manager replicates the streaming job on independent nodes
• rollback recovery
• the stream processor periodically packages the state of our computation into what is called
a checkpoint
• Loss of Resource Management
• Streaming manager
• Application driver
Streaming Systems - Part 2 by Sandeep Malhotra 1403/01/21
Approaches to Stream Processing
• Micro-batching
• Processing is done on a batch of records at fixed intervals that better the real-
time notion of data processing
• Higher latency
• Gives an opportunity for optimization
• One-element-at-a-time
• Processing is done as soon as a record is received
• Almost real-time
Streaming Systems - Part 2 by Sandeep Malhotra 1503/01/21
Stream Processing Model
Data Source
Stream
Processing
System Output Stream
(Data Sink)
Event Stream
(Data Source)
Streaming Systems - Part 2 by Sandeep Malhotra 1603/01/21
Distributed Stream Processing Architecture
Application
Driver
Streaming
Manager
Stream
Processor
Stream
Processor
Stream
Processor
Data
Source/Sink
Data
Source/Sink
Data
Source/Sink
Streaming Systems - Part 2 by Sandeep Malhotra 1703/01/21
Stream Processing Frameworks
• Samza
• Storm
• Spark Streaming
• Flink
• Kafka Streams
• Kinesis Analytics
Streaming Systems - Part 2 by Sandeep Malhotra 1803/01/21
Spark High Level Architecture
Spark Driver
(inside spark application,
contains spark session)
Cluster
Manager
Spark
Executor
Spark
Executor
Spark
Executor
Data
Source/Sink
Data
Source/Sink
Data
Source/Sink
Streaming Systems - Part 2 by Sandeep Malhotra 1903/01/21
Spark Stream APIs
• Spark Streaming (DStream) API
• Computation is done on small batches of data collected from a stream in the form of
microbatches spaced at fixed time intervals
• RDD Based
• Structured Streaming API
• Offers the notion of continuous queries over an unbounded table that is constantly
updated with fresh records from the stream
• Dataframe Based
• SQL Query optimization support
Both stream APIs take the approach of functional programming - they
declare the transformations and aggregations they operate on data streams,
assuming that those streams are immutable
Streaming Systems - Part 2 by Sandeep Malhotra 2003/01/21
Spark Streaming Model
Read
(Streaming Source)
Process
(Transform/Aggregate)
Write
(Streaming Sink)
Micro-batch
Streaming Systems - Part 2 by Sandeep Malhotra 2103/01/21
Structured Streaming Sources
• Socket Source
• Rate Source
• internal stream generator that produces a sequence of records at a
configurable frequency
• File Source
• Multiple format are supported like csv, json, parquet etc.
• Kafka Source
Streaming Systems - Part 2 by Sandeep Malhotra 2203/01/21
Structured Streaming Sinks
• Reliable Sinks
• File Sink
• Kafka Sink
• Experimentation Sinks
• Memory Sink
• Console Sink
Streaming Systems - Part 2 by Sandeep Malhotra 2303/01/21
Spark Streaming Hands-on
Thank You !!

Weitere ähnliche Inhalte

Was ist angesagt?

management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactions
Nilu Desai
 
resource management
  resource management  resource management
resource management
Ashish Kumar
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
Cs704 d distributedschedulingetc.
Cs704 d distributedschedulingetc.Cs704 d distributedschedulingetc.
Cs704 d distributedschedulingetc.
Debasis Das
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algos
Akhil Sharma
 

Was ist angesagt? (20)

Data (1)
Data (1)Data (1)
Data (1)
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactions
 
Distributed concurrency control
Distributed concurrency controlDistributed concurrency control
Distributed concurrency control
 
Ds ppt imp.
Ds ppt imp.Ds ppt imp.
Ds ppt imp.
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
 
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
 
A joint effort of the storage industry
A joint effort of the storage industryA joint effort of the storage industry
A joint effort of the storage industry
 
resource management
  resource management  resource management
resource management
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
 
Chapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationChapter 18 - Distributed Coordination
Chapter 18 - Distributed Coordination
 
Cs704 d distributedschedulingetc.
Cs704 d distributedschedulingetc.Cs704 d distributedschedulingetc.
Cs704 d distributedschedulingetc.
 
Task migration in os
Task migration in osTask migration in os
Task migration in os
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Managing transactions 11g release 1 (10.3
Managing transactions   11g release 1 (10.3Managing transactions   11g release 1 (10.3
Managing transactions 11g release 1 (10.3
 
Chapter00000000
Chapter00000000Chapter00000000
Chapter00000000
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algos
 
Distributed Operating System_2
Distributed Operating System_2Distributed Operating System_2
Distributed Operating System_2
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
 
6.Distributed Operating Systems
6.Distributed Operating Systems6.Distributed Operating Systems
6.Distributed Operating Systems
 

Ähnlich wie Streaming systems - Part 2

VMS Troubleshooting Guide
VMS Troubleshooting GuideVMS Troubleshooting Guide
VMS Troubleshooting Guide
Michael Dotson
 
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
Flink Forward
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 

Ähnlich wie Streaming systems - Part 2 (20)

Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Software artitchteure
Software artitchteureSoftware artitchteure
Software artitchteure
 
Netcetera Proactive Management Service
Netcetera Proactive Management ServiceNetcetera Proactive Management Service
Netcetera Proactive Management Service
 
VMS Troubleshooting Guide
VMS Troubleshooting GuideVMS Troubleshooting Guide
VMS Troubleshooting Guide
 
Let's get to know the Data Streaming
Let's get to know the Data StreamingLet's get to know the Data Streaming
Let's get to know the Data Streaming
 
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
 
Automation use cases_slides_jayendra_saxena
Automation use cases_slides_jayendra_saxenaAutomation use cases_slides_jayendra_saxena
Automation use cases_slides_jayendra_saxena
 
Grds conferences icst and icbelsh (9)
Grds conferences icst and icbelsh (9)Grds conferences icst and icbelsh (9)
Grds conferences icst and icbelsh (9)
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
IRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed DatabaseIRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed Database
 
"How to document your decisions", Dmytro Ovcharenko
"How to document your decisions", Dmytro Ovcharenko "How to document your decisions", Dmytro Ovcharenko
"How to document your decisions", Dmytro Ovcharenko
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Perfmon And Profiler 101
Perfmon And Profiler 101Perfmon And Profiler 101
Perfmon And Profiler 101
 
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
15 Troubleshooting tips and Tricks for Database 21c - KSAOUG
 
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
15 Troubleshooting Tips and Tricks for database 21c - OGBEMEA KSAOUG
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
 

Kürzlich hochgeladen

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Kürzlich hochgeladen (20)

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Streaming systems - Part 2

  • 1. Streaming Systems – Part 2 Sandeep Malhotra techBLEND Group Presentation Series, January 3, 2020 Please refer to Part 1 of this presentation series for streaming basics and message queues
  • 2. Stream Processing Challenges • Processing data, as it arrives, with limited amount of computing resources • Uncertainty in input arrival and hence challenge in predicting peak load • Skew in the time of data generation and data arrival • In a distributed system, splitting up the input stream into partitions leads to each executor getting to see only a partial view of the complete stream Streaming Systems - Part 2 by Sandeep Malhotra 203/01/21
  • 3. Window Aggregation • A window represents a certain amount of data over a certain time interval that we can perform computations on Events Time Time Window Duration Streaming Systems - Part 2 by Sandeep Malhotra 303/01/21
  • 4. Tumbling Window • Grouping function each x period of time • Time periods are inherently consecutive and nonoverlapping • Used when we need to produce aggregates of our data over regular periods of time, with each period independent from previous periods Events Time Window k + 1Window k Window k + 2 Streaming Systems - Part 2 by Sandeep Malhotra 403/01/21
  • 5. Sliding Window • Grouping function over a time interval x reported each y frequency • Time periods are overlapping Events TimeWindow k + 1 Window k Window k + 2 Streaming Systems - Part 2 by Sandeep Malhotra 503/01/21
  • 6. Sessions • Sequences of events terminated by a gap of inactivity greater than some timeout Events Time Window k Window k + 1 Streaming Systems - Part 2 by Sandeep Malhotra 603/01/21
  • 7. Handling Time • Event Time • time at which events actually occurred • Processing Time • time at which events are observed in the system The skew between event time and processing time is: • Non-zero • Depends on the characteristics of the underlying input sources, execution engine, hardware etc. Event Time ProcessingTime Processing Time Lag Event Time Skew Streaming Systems - Part 2 by Sandeep Malhotra 703/01/21
  • 8. Windowing by Processing Time Event Time ProcessingTime • Window boundaries are well defined • Window contents are unrelated to when the events were generated Streaming Systems - Part 2 by Sandeep Malhotra 803/01/21
  • 9. Windowing by Event Time Event Time ProcessingTime • Window contents related to when the events were generated • No natural upper boundary that defines when the window end • Events can come late • Events may not arrive at all Streaming Systems - Part 2 by Sandeep Malhotra 903/01/21
  • 10. Watermarks Event Time ProcessingTime • The oldest timestamp that we will accept on the data stream, at any given moment • Usually much larger than the average delay we expect in the delivery of the events • Closes the open boundary left by the definition of event-time window Streaming Systems - Part 2 by Sandeep Malhotra 1003/01/21
  • 11. Watermarks (contd.) • Outputs are delayed for at least the length of the watermark • Stream processor needs to store a lot of intermediate data and, as such, consume a significant amount of memory that roughly corresponds to • the length of the watermark × the rate of arrival × message size A too small watermark => Too many events are dropped and may produce severely incomplete results. A too large watermark => Increased latency and resource needs Streaming Systems - Part 2 by Sandeep Malhotra 1103/01/21
  • 12. State Management • Dependencies on previous message(s) and/or external data • Two ways to maintain state • Handle it yourself • Use the state management services provided by your framework • Can range from • In-memory • For the simple operations • Replicated queryable persistent storage • Helps answer complicated questions • Enables joining together different streams of data Streaming Systems - Part 2 by Sandeep Malhotra 1203/01/21
  • 13. Message Delivery Semantics • At most once • At least once • Exactly once Streaming Systems - Part 2 by Sandeep Malhotra 1303/01/21
  • 14. Fault Tolerance • Data loss • Data lost • on the network • Because of stream processor or your job crashing • Two common approaches • state-machine • the stream manager replicates the streaming job on independent nodes • rollback recovery • the stream processor periodically packages the state of our computation into what is called a checkpoint • Loss of Resource Management • Streaming manager • Application driver Streaming Systems - Part 2 by Sandeep Malhotra 1403/01/21
  • 15. Approaches to Stream Processing • Micro-batching • Processing is done on a batch of records at fixed intervals that better the real- time notion of data processing • Higher latency • Gives an opportunity for optimization • One-element-at-a-time • Processing is done as soon as a record is received • Almost real-time Streaming Systems - Part 2 by Sandeep Malhotra 1503/01/21
  • 16. Stream Processing Model Data Source Stream Processing System Output Stream (Data Sink) Event Stream (Data Source) Streaming Systems - Part 2 by Sandeep Malhotra 1603/01/21
  • 17. Distributed Stream Processing Architecture Application Driver Streaming Manager Stream Processor Stream Processor Stream Processor Data Source/Sink Data Source/Sink Data Source/Sink Streaming Systems - Part 2 by Sandeep Malhotra 1703/01/21
  • 18. Stream Processing Frameworks • Samza • Storm • Spark Streaming • Flink • Kafka Streams • Kinesis Analytics Streaming Systems - Part 2 by Sandeep Malhotra 1803/01/21
  • 19. Spark High Level Architecture Spark Driver (inside spark application, contains spark session) Cluster Manager Spark Executor Spark Executor Spark Executor Data Source/Sink Data Source/Sink Data Source/Sink Streaming Systems - Part 2 by Sandeep Malhotra 1903/01/21
  • 20. Spark Stream APIs • Spark Streaming (DStream) API • Computation is done on small batches of data collected from a stream in the form of microbatches spaced at fixed time intervals • RDD Based • Structured Streaming API • Offers the notion of continuous queries over an unbounded table that is constantly updated with fresh records from the stream • Dataframe Based • SQL Query optimization support Both stream APIs take the approach of functional programming - they declare the transformations and aggregations they operate on data streams, assuming that those streams are immutable Streaming Systems - Part 2 by Sandeep Malhotra 2003/01/21
  • 21. Spark Streaming Model Read (Streaming Source) Process (Transform/Aggregate) Write (Streaming Sink) Micro-batch Streaming Systems - Part 2 by Sandeep Malhotra 2103/01/21
  • 22. Structured Streaming Sources • Socket Source • Rate Source • internal stream generator that produces a sequence of records at a configurable frequency • File Source • Multiple format are supported like csv, json, parquet etc. • Kafka Source Streaming Systems - Part 2 by Sandeep Malhotra 2203/01/21
  • 23. Structured Streaming Sinks • Reliable Sinks • File Sink • Kafka Sink • Experimentation Sinks • Memory Sink • Console Sink Streaming Systems - Part 2 by Sandeep Malhotra 2303/01/21