SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Discretized Streams
Fault-Tolerant Streaming Computation at Scale
Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li,
Timothy Hunter, Scott Shenker, Ion Stoica
Motivation
Many big-data applications need to process large data
streams in near-real time
Website monitoring
Fraud detection
Ad monetizationRequire tens to hundreds of nodes
Require second-scale latencies
Challenge
• Stream processing systems must recover from failures
and stragglers quickly and efficiently
– More important for streaming systems than batch systems
• Traditional streaming systems don’t achieve these
properties simultaneously
Outline
• Limitations ofTraditional Streaming Systems
• Discretized Stream Processing
• Unification with Batch and Interactive Processing
Traditional Streaming Systems
• Continuous operator model mutable state
node 1
node 3
input
records
node 2
input
records
– Each node runs an operator
with in-memory mutable state
– For each input record, state is
updated and new records are
sent out
• Mutable state is lost if node fails
• Various techniques exist to make state fault-tolerant
Fault-tolerance inTraditional Systems
• Separate set of “hot failover” nodes
process the same data streams
• Synchronization protocols ensures
exact ordering of records in both sets
• On failure, the system switches over
to the failover nodes
sync
protocol
input
input
hot
failover
nodes
Fast recovery, but 2x hardware cost
Node Replication [e.g. Borealis, Flux ]
input
input
Fault-tolerance inTraditional Systems
Upstream Backup [e.g.TimeStream, Storm ]
cold failover
node
• Each node maintains backup of the
forwarded records since last checkpoint
• A “cold failover” node is maintained
• On failure, upstream nodes replay the
backup records serially to the failover
node to recreate the state
Only need 1 standby, but slow recovery
backup
replay
Slow Nodes inTraditional Systems
Node Replication Upstream Backup
input
input
input
input
Neither approach handles stragglers
Our Goal
• Scales to hundreds of nodes
• Achieves second-scale latency
• Tolerate node failures and stragglers
• Sub-second fault and straggler recovery
• Minimal overhead beyond base processing
Our Goal
• Scales to hundreds of nodes
• Achieves second-scale latency
• Tolerate node failures and stragglers
• Sub-second fault and straggler recovery
• Minimal overhead beyond base processing
Why is it hard?
Stateful continuous operators tightly integrate
“computation” with “mutable state”
Makes it harder to define clear boundaries when
computation and state can be moved around
stateful
continuous
operator
mutable
state
input
records
output
records
Dissociate computation from state
Make state immutable and break computation into
small, deterministic, stateless tasks
Defines clear boundaries where state and computation
can be moved around independently
stateless
task
state 1
input 1
state 2
stateless
task
state 2
input 2
stateless
task
input 3
Batch Processing Systems!
Batch Processing Systems
Batch processing systems like MapReduce divide
– Data into small partitions
– Jobs into small, deterministic, stateless map / reduce
tasks
M
M
M
M
M
M
M
M
M
M
M
M
stateless
map tasks
stateless
reduce tasks
R
R
R
R
R
R
immutable
input dataset
immutable
output datasetimmutable
map outputs
Parallel Recovery
Failed tasks are re-executed on the
other nodes in parallel
M
M
M
R
R
M
M
M
M
M
M
stateless
map tasks
stateless
reduce tasks
RR
R
R
MMM
M
M
M
R
R
immutable
input dataset
immutable
output dataset
Discretized Stream
Processing
Discretized Stream Processing
Run a streaming computation as a series of
small, deterministic batch jobs
Store intermediate state data in cluster memory
Try to make batch sizes as small as possible
to get second-scale latencies
Discretized Stream Processing
time = 0 - 1: batch operations
Input: replicated
dataset stored in
memory
Output or State:
non-replicated
dataset
stored in memory
input stream output / state stream
…
…
input
time = 1 - 2:
input
Example: Counting page views
Discretized Stream (DStream) is a sequence of
immutable, partitioned datasets
– Can be created from live data streams or by applying bulk,
parallel transformations on other DStreams
views = readStream("http:...", "1 sec")
ones = views.map(ev => (ev.url, 1))
counts = ones.runningReduce((x,y) => x+y)
creating a DStream
transformation
views ones counts
t: 0 - 1
t: 1 - 2
map reduce
Fine-grained Lineage
• Datasets track fine-grained
operation lineage
• Datasets are periodically
checkpointed asynchronously
to prevent long lineages
views ones counts
t: 0 - 1
t: 1 - 2
map reduce
t: 2 - 3
• Lineage is used to recompute
partitions lost due to failures
• Datasets on different time
steps recomputed in parallel
• Partitions within a dataset
also recomputed in parallel
views ones counts
t: 0 - 1
t: 1 - 2
map reduce
t: 2 - 3
Parallel Fault Recovery
Upstream Backup
stream replayed serially
state
views ones counts
t: 0 - 1
t: 1 - 2
t: 2 - 3
Discretized Stream
Processing
parallelism
across time
intervals
parallelism
within a batchFaster recovery than upstream backup,
without the 2x cost of node replication
Comparison to Upstream Backup
How much faster than Upstream Backup?
Recover time = time taken to recompute and catch up
– Depends on available resources in the cluster
– Lower system load before failure allows faster recovery
Parallel
recovery with
5 nodes faster
than upstream
backup
Parallel
recovery with
10 nodes
faster than 5
nodes
Parallel Straggler Recovery
• Straggler mitigation techniques
– Detect slow tasks (e.g. 2X slower than other tasks)
– Speculatively launch more copies of the tasks in parallel
on other machines
• Masks the impact of slow nodes on the progress of
the system
Evaluation
Spark Streaming
• Implemented using Spark processing engine*
– Spark allows datasets to be stored in memory, and
automatically recovers them using lineage
• Modifications required to reduce jobs launching
overheads from seconds to milliseconds
[ *Resilient Distributed Datasets - NSDI, 2012 ]
How fast is Spark Streaming?
Can process 60M records/second on
100 nodes at 1 second latency
Tested with 100 4-core EC2 instances and 100 streams of text
28
0
1
2
3
4
0 50 100
ClusterThroughput(GB/s)
# Nodes in Cluster
WordCount
1 sec
2 sec
0
2
4
6
8
0 50 100
ClusterThhroughput(GB/s)
# Nodes in Cluster
Grep
1 sec
2 sec
Count the sentences
having a keyword
WordCount over 30 sec
sliding window
How does it compare to others?
Throughput comparable to other commercial
stream processing systems
29
System
Throughput per core
[ records / sec ]
Spark Streaming 160k
Oracle CEP 125k
Esper 100k
StreamBase 30k
Storm 30k
[ Refer to the paper for citations ]
0.0
1.0
2.0
3.0
4.0
5.0
BatchProcessingTime(s)
30s ckpts, 20 nodes
30s ckpts, 40 nodes
10s ckpts, 20 nodes
10s ckpts, 40 nodes
How fast can it recover from faults?
Recovery time improves with more frequent
checkpointing and more nodes
Failure
Word Count over
30 sec window
How fast can it recover from stragglers?
0.55 0.54
3.02 2.40
1.00
0.64
0.0
1.0
2.0
3.0
4.0
WordCount Grep
BatchProcessingTime(s)
No straggler Straggler, no speculation
Straggler, with speculation
Speculative execution of slow tasks mask the
effect of stragglers
Unification with Batch and
Interactive Processing
Unification with Batch and
Interactive Processing
• Discretized Streams creates a single programming
and execution model for running streaming, batch
and interactive jobs
• Combine live data streams with historic data
liveCounts.join(historicCounts).map(...)
• Interactively query live streams
liveCounts.slice(“21:00”, “21:05”).count()
App combining live + historic data
Mobile Millennium Project: Real-time estimation of
traffic transit times using live and past GPS observations
34
0
400
800
1200
1600
2000
0 20 40 60 80
GPSobservationsper
second
# Nodes in Cluster
• Markov chain Monte Carlo
simulations on GPS observations
• Very CPU intensive
• Scales linearly with cluster size
Recent Related Work
• Naiad – Full cluster rollback on recovery
• SEEP – Extends continuous operators to enable
parallel recovery, but does not handle stragglers
• TimeStream – Recovery similar to upstream backup
• MillWheel – State stored in BigTable, transactions
per state update can be expensive
Takeaways
• Discretized Streams model streaming computation
as series of batch jobs
– Uses simple techniques to exploit parallelism in streams
– Scales to 100 nodes with 1 second latency
– Recovers from failures and stragglers very fast
• Spark Streaming is open source - spark-project.org
– Used in production by ~ 10 organizations!
• Large scale streaming systems must
handle faults and stragglers

Weitere ähnliche Inhalte

Was ist angesagt?

Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
AbDul ThaYyal
 

Was ist angesagt? (20)

Operating Systems - Process Synchronization and Deadlocks
Operating Systems - Process Synchronization and DeadlocksOperating Systems - Process Synchronization and Deadlocks
Operating Systems - Process Synchronization and Deadlocks
 
Modul Sistem Operasi Semaphore
Modul Sistem Operasi SemaphoreModul Sistem Operasi Semaphore
Modul Sistem Operasi Semaphore
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
 
Input output hardware of operating system
Input output hardware of operating systemInput output hardware of operating system
Input output hardware of operating system
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Demand paging
Demand pagingDemand paging
Demand paging
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page table
 
7 Deadlocks
7 Deadlocks7 Deadlocks
7 Deadlocks
 
Time space trade off
Time space trade offTime space trade off
Time space trade off
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memory
 
Virtual memory presentation
Virtual memory presentationVirtual memory presentation
Virtual memory presentation
 
Implementasi Virtual Memory Kelompok 3
Implementasi Virtual Memory Kelompok 3Implementasi Virtual Memory Kelompok 3
Implementasi Virtual Memory Kelompok 3
 
Uni Processor Architecture
Uni Processor ArchitectureUni Processor Architecture
Uni Processor Architecture
 
T3 – Query Lanjutan [1]
T3 – Query Lanjutan [1]T3 – Query Lanjutan [1]
T3 – Query Lanjutan [1]
 
Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency control
 
Round robin scheduling
Round robin schedulingRound robin scheduling
Round robin scheduling
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 

Ähnlich wie Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP

Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
Nick Kypreos
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depth
IdcIdk1
 

Ähnlich wie Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP (20)

Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Training Slides: 202 - Monitoring & Troubleshooting
Training Slides: 202 - Monitoring & TroubleshootingTraining Slides: 202 - Monitoring & Troubleshooting
Training Slides: 202 - Monitoring & Troubleshooting
 
Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsImpatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
 
data-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptxdata-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptx
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
Discretized streams
Discretized streamsDiscretized streams
Discretized streams
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depth
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 

Kürzlich hochgeladen

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 

Kürzlich hochgeladen (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP

  • 1. Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion Stoica
  • 2. Motivation Many big-data applications need to process large data streams in near-real time Website monitoring Fraud detection Ad monetizationRequire tens to hundreds of nodes Require second-scale latencies
  • 3.
  • 4. Challenge • Stream processing systems must recover from failures and stragglers quickly and efficiently – More important for streaming systems than batch systems • Traditional streaming systems don’t achieve these properties simultaneously
  • 5. Outline • Limitations ofTraditional Streaming Systems • Discretized Stream Processing • Unification with Batch and Interactive Processing
  • 6. Traditional Streaming Systems • Continuous operator model mutable state node 1 node 3 input records node 2 input records – Each node runs an operator with in-memory mutable state – For each input record, state is updated and new records are sent out • Mutable state is lost if node fails • Various techniques exist to make state fault-tolerant
  • 7. Fault-tolerance inTraditional Systems • Separate set of “hot failover” nodes process the same data streams • Synchronization protocols ensures exact ordering of records in both sets • On failure, the system switches over to the failover nodes sync protocol input input hot failover nodes Fast recovery, but 2x hardware cost Node Replication [e.g. Borealis, Flux ]
  • 8. input input Fault-tolerance inTraditional Systems Upstream Backup [e.g.TimeStream, Storm ] cold failover node • Each node maintains backup of the forwarded records since last checkpoint • A “cold failover” node is maintained • On failure, upstream nodes replay the backup records serially to the failover node to recreate the state Only need 1 standby, but slow recovery backup replay
  • 9. Slow Nodes inTraditional Systems Node Replication Upstream Backup input input input input Neither approach handles stragglers
  • 10. Our Goal • Scales to hundreds of nodes • Achieves second-scale latency • Tolerate node failures and stragglers • Sub-second fault and straggler recovery • Minimal overhead beyond base processing
  • 11. Our Goal • Scales to hundreds of nodes • Achieves second-scale latency • Tolerate node failures and stragglers • Sub-second fault and straggler recovery • Minimal overhead beyond base processing
  • 12. Why is it hard? Stateful continuous operators tightly integrate “computation” with “mutable state” Makes it harder to define clear boundaries when computation and state can be moved around stateful continuous operator mutable state input records output records
  • 13. Dissociate computation from state Make state immutable and break computation into small, deterministic, stateless tasks Defines clear boundaries where state and computation can be moved around independently stateless task state 1 input 1 state 2 stateless task state 2 input 2 stateless task input 3
  • 15. Batch Processing Systems Batch processing systems like MapReduce divide – Data into small partitions – Jobs into small, deterministic, stateless map / reduce tasks M M M M M M M M M M M M stateless map tasks stateless reduce tasks R R R R R R immutable input dataset immutable output datasetimmutable map outputs
  • 16. Parallel Recovery Failed tasks are re-executed on the other nodes in parallel M M M R R M M M M M M stateless map tasks stateless reduce tasks RR R R MMM M M M R R immutable input dataset immutable output dataset
  • 18. Discretized Stream Processing Run a streaming computation as a series of small, deterministic batch jobs Store intermediate state data in cluster memory Try to make batch sizes as small as possible to get second-scale latencies
  • 19. Discretized Stream Processing time = 0 - 1: batch operations Input: replicated dataset stored in memory Output or State: non-replicated dataset stored in memory input stream output / state stream … … input time = 1 - 2: input
  • 20. Example: Counting page views Discretized Stream (DStream) is a sequence of immutable, partitioned datasets – Can be created from live data streams or by applying bulk, parallel transformations on other DStreams views = readStream("http:...", "1 sec") ones = views.map(ev => (ev.url, 1)) counts = ones.runningReduce((x,y) => x+y) creating a DStream transformation views ones counts t: 0 - 1 t: 1 - 2 map reduce
  • 21. Fine-grained Lineage • Datasets track fine-grained operation lineage • Datasets are periodically checkpointed asynchronously to prevent long lineages views ones counts t: 0 - 1 t: 1 - 2 map reduce t: 2 - 3
  • 22. • Lineage is used to recompute partitions lost due to failures • Datasets on different time steps recomputed in parallel • Partitions within a dataset also recomputed in parallel views ones counts t: 0 - 1 t: 1 - 2 map reduce t: 2 - 3 Parallel Fault Recovery
  • 23. Upstream Backup stream replayed serially state views ones counts t: 0 - 1 t: 1 - 2 t: 2 - 3 Discretized Stream Processing parallelism across time intervals parallelism within a batchFaster recovery than upstream backup, without the 2x cost of node replication Comparison to Upstream Backup
  • 24. How much faster than Upstream Backup? Recover time = time taken to recompute and catch up – Depends on available resources in the cluster – Lower system load before failure allows faster recovery Parallel recovery with 5 nodes faster than upstream backup Parallel recovery with 10 nodes faster than 5 nodes
  • 25. Parallel Straggler Recovery • Straggler mitigation techniques – Detect slow tasks (e.g. 2X slower than other tasks) – Speculatively launch more copies of the tasks in parallel on other machines • Masks the impact of slow nodes on the progress of the system
  • 27. Spark Streaming • Implemented using Spark processing engine* – Spark allows datasets to be stored in memory, and automatically recovers them using lineage • Modifications required to reduce jobs launching overheads from seconds to milliseconds [ *Resilient Distributed Datasets - NSDI, 2012 ]
  • 28. How fast is Spark Streaming? Can process 60M records/second on 100 nodes at 1 second latency Tested with 100 4-core EC2 instances and 100 streams of text 28 0 1 2 3 4 0 50 100 ClusterThroughput(GB/s) # Nodes in Cluster WordCount 1 sec 2 sec 0 2 4 6 8 0 50 100 ClusterThhroughput(GB/s) # Nodes in Cluster Grep 1 sec 2 sec Count the sentences having a keyword WordCount over 30 sec sliding window
  • 29. How does it compare to others? Throughput comparable to other commercial stream processing systems 29 System Throughput per core [ records / sec ] Spark Streaming 160k Oracle CEP 125k Esper 100k StreamBase 30k Storm 30k [ Refer to the paper for citations ]
  • 30. 0.0 1.0 2.0 3.0 4.0 5.0 BatchProcessingTime(s) 30s ckpts, 20 nodes 30s ckpts, 40 nodes 10s ckpts, 20 nodes 10s ckpts, 40 nodes How fast can it recover from faults? Recovery time improves with more frequent checkpointing and more nodes Failure Word Count over 30 sec window
  • 31. How fast can it recover from stragglers? 0.55 0.54 3.02 2.40 1.00 0.64 0.0 1.0 2.0 3.0 4.0 WordCount Grep BatchProcessingTime(s) No straggler Straggler, no speculation Straggler, with speculation Speculative execution of slow tasks mask the effect of stragglers
  • 32. Unification with Batch and Interactive Processing
  • 33. Unification with Batch and Interactive Processing • Discretized Streams creates a single programming and execution model for running streaming, batch and interactive jobs • Combine live data streams with historic data liveCounts.join(historicCounts).map(...) • Interactively query live streams liveCounts.slice(“21:00”, “21:05”).count()
  • 34. App combining live + historic data Mobile Millennium Project: Real-time estimation of traffic transit times using live and past GPS observations 34 0 400 800 1200 1600 2000 0 20 40 60 80 GPSobservationsper second # Nodes in Cluster • Markov chain Monte Carlo simulations on GPS observations • Very CPU intensive • Scales linearly with cluster size
  • 35. Recent Related Work • Naiad – Full cluster rollback on recovery • SEEP – Extends continuous operators to enable parallel recovery, but does not handle stragglers • TimeStream – Recovery similar to upstream backup • MillWheel – State stored in BigTable, transactions per state update can be expensive
  • 36. Takeaways • Discretized Streams model streaming computation as series of batch jobs – Uses simple techniques to exploit parallelism in streams – Scales to 100 nodes with 1 second latency – Recovers from failures and stragglers very fast • Spark Streaming is open source - spark-project.org – Used in production by ~ 10 organizations! • Large scale streaming systems must handle faults and stragglers

Hinweis der Redaktion

  1. There are many big data applications that need to process large streams of data and process results in a near real time manner. For example, a website monitoring systems may want watch over websites for load spikes. A fraud detection system may want to monitor bank transaction in real time to detect fraud. An ad agency may want to count clicks in real time. The common property across a lot of such systesm is that they require large clusters of tens to hundreds of nodes to handle the streaming load and they require an end to end latency of few seconds.
  2. And an obvious problem with building large systems is that you have to deal with node failures
  3. And not just failures but they must deal with stragglers as well. And its important for streaming systems to recover from failures and straggler quickly and efficiently. In fact it is more important for streaming systems than batch systems, because you don’t want your fraud detection system to be down for a long time. The problem is that traditions streaming systems don’t achieve both thse properties together.
  4. In this talk, I am going to first discuss the limitations of traditional streaming systems. Then I am going to elaborate about discretized stream processing. Finally, I am also going to talk about how this unifies stream processing with batch and interactive processing.
  5. Traditional stream processing use the continuous operator model, where every node in the processing pipeline continuously run an operator with in-memory mutable state. As each input records is received, the mutable state is updated and new records are sent out to downstream nodes. The problem with this model is that the mutable state is lost if the node fails. To deal with this ,various techniques have been developed to make this state fault-tolerant. I am going to divide them into two broad classes and explain their limitations.
  6. First, systems like Borealis and Flux have used Node Replication, where a separate set of nodes process the same stream along with the main nodes. These nodes act as hot failover nodes. Synchronization protocols ensure that both sets of nodes process the records in exactly the same order. When a node fails, the system immediately switches over to the failover nodes, and processing continues with little or no disruption. This results in fast recovery but it comes at the cost of double hardware. At large scale that may not be desirable. You may not want to run 200 nodes to handle the processing of 100 nodes.
  7. The second method is called upstream backup, where each maintains a back up of all the records it has forwarded to the down stream node, until the downstream node acknwoledge that it has done processing them, and checkpointed the state, etc. A cold failover node is maintained and when a node fails, the system replays the backed up records serially to the failover node to recreate the lost state. While this only requires one standby node, it is slow to recover because of recomputation. So you can see that either you can get fast recover at high cost, or slow recovery for cheap.
  8. Furthermore, let us consider what happens when there are stragglers. Suppose this node happily decides to slow down. Then this causes the downstream nodes to slowdown as well. And in case of replicatoin, since the replicas need to be in sycn, they effectively slow down as well. So neither approach handles stragglers.
  9. So we want to build a streaming system with the following ambitious goals. Besides scaling hundreds of nodes with second-scale latencies, the system should tolerate faults and stragglers. In particular, it should recover from faults and straggler within seconds, which is hard for upstream backup to do. And it should use minimal extra hardware beyond normal processing, unlike node replication.
  10. So we want to build a streaming system with the following ambitious goals. Besides scaling hundreds of nodes with second-scale latencies, the system should tolerate faults and stragglers. In particular, it should recover from faults and straggler within seconds, which is hard for upstream backup to do. And it should use minimal extra hardware beyond normal processing, unlike node replication.
  11. So we accepted the challenge and decided to understand why it is hard. That is because the continuous operator model tightly integrates computation with mutable state. This makes it harder to define clear boundaries in the computation and state such that they can be move around.
  12. Instead what we propose is to dissociate the computation from state. that is, make the state immutable, and break the computation into smaller deterministic and stateless tasks. So to do stateful operation, the system would take the previous state as input to the stateless task and generate the next state as output. And this continues with more tasks. This defines clear boundaries where state and computation can be moved around independently. Now this sound very abstract but the interesting this is that existing systems already do this.
  13. Systems like MapReduce to divide the data into smaller partitions, and the job into small deterministic and stateless map reduce tasks. Lets walk through a map reduce job. You start from an immutable dataset already divided into partitions. For each partition, the system runs a bunch of stateless map tasks which generate immutable map outputs. Then they are shuffled together and stateless reduce tasks generate the output immutable dataset
  14. And because these tasks are stateless they can be moved around easily. For example, if a node fails, and some maps tasks are lost and some reduce tasks are lost as well, and some partitions in the output were not generated. The system can automatically re run the map tasks in parallel on the other nodes. And then also the reduce tasks, finally generating the ncessary partitions. Lets see how we use this parallel recovery property in our streaming processing model that we call Discretized Stream Processing.
  15. Lets see how to program against this model. The programming abstraction that we provide is called Dstream, short for Discretized streams, which is a sequence of immutable partitioned datasets. These Dstreams can either be created from the live data stream, as shown in the program. readStream creates a dstream from live data stream of page views. Or Dstream can be created by applying bulk parallel transformations on other Dstreams, as shown in the program by the map and running reduce. This program maintains a running count of the page views. The underlying datasets look like this. as you can see the counts, the blue datasets are the state datasets that addup the counts across batches.
  16. The system keeps track of the fine-grained partition level lineage. Now we don’t want the lineage to grow too long as that would cause a lot of recopmutation. so we periodically checkpoint some state rdds to stable storage and cut off the lineage. Note that these datasets are immutable, so checkpointing can be done asynchronously without blocking the computation.
  17. This lineage is used to recover lost data partitions in parallel. For example, suppose some nodes die a few partitions here and there gets lost. Now since these datasets are in different time intervals and don’t depend on each other, hence they will be recomputed in parallel by running the map-reduce tasks that generated them. Similarly, partitions within the same datasets will also be recomputed in parallel. Now if you compare this to upstream backup…
  18. Upstream backup replays the whole stream serially to one node to recreate the state that was lost. however, in our model, the lineage exposes all the stuff that can recomputed in parallel. Specifically it exposes parallelism within a dataset as well as across time, that is datasets in different time intervals. This ensures faster recovery than upstream backup without incurring the double cost of node replication.
  19. How much faster can we be from upstream backup. For this we did a theoretical anayslis. We define the recovery time as the time taken to recompute the lost data and catch up with the live stream. This naturally depends on the available resources in the cluster, which in turn depends on the system load prior to the failure. The graph shows recovery time with respect to different system loads, and naturally lower system load leads to faster recovery. Note that parallel recovery with 5 nodes is significantly fastter than upstream backup. And also, parallel recovery with 10 nodes is faster than 5 nodes. This is because with 10 nodes, there is more parallelism to be used for fault recovery. This is another very cool property of this model, that larger clusters allow you recover faster.
  20. The same principles can be used to recover from stragglers as well. Straggler mitigation techniques in batch processing typically do the following. First they detect slow tasks or straggling task by some criteria – for example, say tasks running more than 2 times slower than other tasks in the same stage. These tasks are then specula6ively rexecuted by running more copies of them in parallel on the other machines. Whichever finishes first wins. Since these tasks are stateless, and the output immutable, running multiple copies of them and letting them race to the finish is semantically equivalent. This essentially masks the impact of slow nodes on the progress of the system.
  21. To evaluate this model, we built Spark Streaming. It is built using the Spark processing engine. For those who are not familiar with Spark, it is a super fast batch processing engine that allow datasets to be stored in memory and automatically recovers them using lineage. We had to make signification modification to Spark to reduce job launching overhead from seconds to milliseconds in order to achieve sub-second end-to-end latencies.
  22. Spark Streaming can process 60 million records per second on 100 nodes with an end-to-end latency of 1 second. This was tested using 100 4 core EC2 instances and 100 streams of text. The workload was Grep in which we count the number of sentences having a keyword. We also tried a stateful workload WordCount over a 30 second sliding window, and the throughput was slightly lower because of increased computation but still in the ballpark range of 10s of millions of records per second. Note that in both cases, the throughput of the cluster scale linearly with the size of the cluster. Furthermore if you allow a 2 second latency, then the throughput goes up a little bit more.
  23. If you compare with other commercial streaming systems, Spark Streaming has comparable per-core performance to numbers reported by Oracle CEP, Esper and StreamBase. Note that these 3 are single node systems, not distributed.
  24. We test how we fast we can recover from faults. This graph shows the time take process batches before and after failure. Obviously right after failure the
  25. We also tested how we perform with stragglers by slowing down a node in the system. The graph shows that without the straggler the processing time for both workloads were about half a second. With straggler it increased to 2 to 3 seconds. But speculative execution it came back to levels comparable to the earlier half second. This shows that speculative execution can mask the effect of straggler. No other system to our knowledge has shown this kind of evaluation.
  26. Naiad, the paper you are going to hear about next, takes a completely different approach to unification different processing models, without sacrificing latency. Which is great. But in terms of fault tolerance the state is synchronuous checkpointed compared to our asynchronous checkpointing. And in case of failure, all the nodes in the cluster has to rollback to their previous checkpoints, which can be very costly. SEEP attempts to extend the continuous operators to enable parallel recovery of their state, but it requires invasive rewriting of the operators. In our case we were simply able