SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Stateful distributed stream
processing
Gyula Fóra
gyfora@apache.org
@GyulaFora
This talk
§ Stateful processing by example
§ Definition and challenges
§ State in current open-source systems
§ State in Apache Flink
§ Closing
2Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Stateful processing by example
§ Window aggregations
• Total number of customers
in the last 10 minutes
• State: Current aggregate
§ Machine learning
• Fitting trends to the evolving
stream
• State: Model
3Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Stateful processing by example
§ Pattern recognition
• Detect suspicious financial
activity
• State: Matched prefix
§ Stream-stream joins
• Match ad views and
impressions
• State: Elements in the window
4Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Stateful operators
§ All these examples use a common processing
pattern
§ Stateful operator (in essence):
𝒇:	
   𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆.
§ State hangs around and can be read and
modified as the stream evolves
§ Goal: Get as close as possible while
maintaining scalability and fault-tolerance
5Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
State-of-the-art systems
§ Most systems allow developers to
implement stateful programs
§ Trick is to limit the scope of 𝒇 (state access)
while maintaining expressivity
§ Issues to tackle:
• Expressivity
• Exactly-once semantics
• Scalability to large inputs
• Scalability to large states
6Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
§ States available only in Trident API
§ Dedicated operators for state updates and
queries
§ State access methods
• stateQuery(…)
• partitionPersist(…)
• persistentAggregate(…)
§ It’s very difficult to
implement transactional
states
Exactly-­‐‑once  guarantee
7Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Storm Word Count
8Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
§ Stateless runtime by design
• No continuous operators
• UDFs are assumed to be stateless
§ State can be generated as a stream of
RDDs: updateStateByKey(…)
𝒇:	
   𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆.
𝒌
§ 𝒇 is scoped to a specific key
§ Exactly-once semantics
9Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
val stateDstream = wordDstream.updateStateByKey[Int](
newUpdateFunc,
new HashPartitioner(ssc.sparkContext.defaultParallelism),
true,
initialRDD)
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
Spark Streaming Word Count
10Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
§ Stateful dataflow operators
(Any task can hold state)
§ State changes are stored
as a log by Kafka
§ Custom storage engines can
be plugged in to the log
§ 𝒇 is scoped to a specific task
§ At-least-once processing
semantics
11Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Samza Word Count
public class WordCounter implements StreamTask, InitableTask {
//Some omitted details…
private KeyValueStore<String, Integer> store;
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
//Get the current count
String word = (String) envelope.getKey();
Integer count = store.get(word);
if (count == null) count = 0;
//Increment, store and send
count += 1;
store.put(word, count);
collector.send(
new OutgoingMessageEnvelope(OUTPUT_STREAM, word ,count));
}
}
12Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
What can we say so far?
§ Trident
+ Consistent state accessible from outside
– Only works well with idempotent states
– States are not part of the operators
§ Spark
+ Integrates well with the system guarantees
– Limited expressivity
– Immutability increases update complexity
§ Samza
+ Efficient log based state updates
+ States are well integrated with the operators
– Lack of exactly-once semantics
– State access is not fully transparent
13Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
§ Take what’s good, make it work + add
some more
§ Clean and powerful abstractions
• Local (Task) state
• Partitioned (Key) state
§ Proper API integration
• Java: OperatorState interface
• Scala: mapWithState, flatMapWithState…
§ Exactly-once semantics by checkpointing
14Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Flink Word Count
words.keyBy(x => x).mapWithState {
(word, count: Option[Int]) =>
{
val newCount = count.getOrElse(0) + 1
val output = (word, newCount)
(output, Some(newCount))
}
}
15Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Local State
§ Task scoped state access
§ Can be used to implement
custom access patterns
§ Typical usage:
• Source operators (offset)
• Machine learning models
• Use cyclic flows to simulate
global state access
16Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Local State Example (Java)
public class MySource extends RichParallelSourceFunction {
// Omitted details
private OperatorState<Long> offset;
@Override
public void run(SourceContext ctx) {
Object checkpointLock = ctx.getCheckpointLock();
isRunning = true;
while (isRunning) {
synchronized (checkpointLock) {
offset.update(offset.value() + 1);
// ctx.collect(next);
}
}
}
}
17Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Partitioned State
§ Key scoped state access
§ Highly scalable
§ Allows for incremental
backup/restore
§ Typical usage:
• Any per-key operation
• Grouped aggregations
• Window buffers
18Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Partitioned State Example (Scala)
// Compute the current average of each city's temperature
temps.keyBy("city").mapWithState {
(in: Temp, state: Option[(Double, Long)]) =>
{
val current = state.getOrElse((0.0, 0L))
val updated = (current._1 + in.temp, current._2 + 1)
val avg = Temp(in.city, updated._1 / updated._2)
(avg, Some(updated))
}
}
case class Temp(city: String, temp: Double)
19Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Exactly-once semantics
§ Based on consistent global snapshots
§ Algorithm designed for stateful dataflows
20Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Detailed  mechanism
Exactly-once semantics
§ Low runtime overhead
§ Checkpointing logic is separated from
application logic
21Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Blogpost  on  streaming  fault-­‐‑tolerance
Summary
§ State is essential to many applications
§ Fault-tolerant streaming state is a hard
problem
§ There is a trade-off between expressivity vs
scalability/fault-tolerance
§ Flink tries to hit the sweet spot with…
• Providing very flexible abstractions
• Keeping good scalability and exactly-once
semantics
22Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 

Was ist angesagt? (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Don't Cross The Streams - Data Streaming And Apache Flink
Don't Cross The Streams  - Data Streaming And Apache FlinkDon't Cross The Streams  - Data Streaming And Apache Flink
Don't Cross The Streams - Data Streaming And Apache Flink
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 

Andere mochten auch

Andere mochten auch (20)

Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area Meetup
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
Designing and Testing Accumulo Iterators
Designing and Testing Accumulo IteratorsDesigning and Testing Accumulo Iterators
Designing and Testing Accumulo Iterators
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Composing Project Archetyps with SBT AutoPlugins
Composing Project Archetyps with SBT AutoPluginsComposing Project Archetyps with SBT AutoPlugins
Composing Project Archetyps with SBT AutoPlugins
 
Transformative Git Practices
Transformative Git PracticesTransformative Git Practices
Transformative Git Practices
 
A Scala Corrections Library
A Scala Corrections LibraryA Scala Corrections Library
A Scala Corrections Library
 
Lightning Talk: Running MongoDB on Docker for High Performance Deployments
Lightning Talk: Running MongoDB on Docker for High Performance DeploymentsLightning Talk: Running MongoDB on Docker for High Performance Deployments
Lightning Talk: Running MongoDB on Docker for High Performance Deployments
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
Effective Actors
Effective ActorsEffective Actors
Effective Actors
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at King
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 

Ähnlich wie Stateful Distributed Stream Processing

Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 

Ähnlich wie Stateful Distributed Stream Processing (20)

Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Comparative Evaluation of Spark and Flink Stream Processing
Comparative Evaluation of Spark and Flink Stream Processing Comparative Evaluation of Spark and Flink Stream Processing
Comparative Evaluation of Spark and Flink Stream Processing
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
SCALE - Stream processing and Open Data, a match made in Heaven
SCALE - Stream processing and Open Data, a match made in HeavenSCALE - Stream processing and Open Data, a match made in Heaven
SCALE - Stream processing and Open Data, a match made in Heaven
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 

Kürzlich hochgeladen

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Kürzlich hochgeladen (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Stateful Distributed Stream Processing

  • 1. Stateful distributed stream processing Gyula Fóra gyfora@apache.org @GyulaFora
  • 2. This talk § Stateful processing by example § Definition and challenges § State in current open-source systems § State in Apache Flink § Closing 2Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 3. Stateful processing by example § Window aggregations • Total number of customers in the last 10 minutes • State: Current aggregate § Machine learning • Fitting trends to the evolving stream • State: Model 3Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 4. Stateful processing by example § Pattern recognition • Detect suspicious financial activity • State: Matched prefix § Stream-stream joins • Match ad views and impressions • State: Elements in the window 4Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 5. Stateful operators § All these examples use a common processing pattern § Stateful operator (in essence): 𝒇:   𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆. § State hangs around and can be read and modified as the stream evolves § Goal: Get as close as possible while maintaining scalability and fault-tolerance 5Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 6. State-of-the-art systems § Most systems allow developers to implement stateful programs § Trick is to limit the scope of 𝒇 (state access) while maintaining expressivity § Issues to tackle: • Expressivity • Exactly-once semantics • Scalability to large inputs • Scalability to large states 6Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 7. § States available only in Trident API § Dedicated operators for state updates and queries § State access methods • stateQuery(…) • partitionPersist(…) • persistentAggregate(…) § It’s very difficult to implement transactional states Exactly-­‐‑once  guarantee 7Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 8. Storm Word Count 8Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 9. § Stateless runtime by design • No continuous operators • UDFs are assumed to be stateless § State can be generated as a stream of RDDs: updateStateByKey(…) 𝒇:   𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆. 𝒌 § 𝒇 is scoped to a specific key § Exactly-once semantics 9Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 10. val stateDstream = wordDstream.updateStateByKey[Int]( newUpdateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD) val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } Spark Streaming Word Count 10Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 11. § Stateful dataflow operators (Any task can hold state) § State changes are stored as a log by Kafka § Custom storage engines can be plugged in to the log § 𝒇 is scoped to a specific task § At-least-once processing semantics 11Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 12. Samza Word Count public class WordCounter implements StreamTask, InitableTask { //Some omitted details… private KeyValueStore<String, Integer> store; public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { //Get the current count String word = (String) envelope.getKey(); Integer count = store.get(word); if (count == null) count = 0; //Increment, store and send count += 1; store.put(word, count); collector.send( new OutgoingMessageEnvelope(OUTPUT_STREAM, word ,count)); } } 12Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 13. What can we say so far? § Trident + Consistent state accessible from outside – Only works well with idempotent states – States are not part of the operators § Spark + Integrates well with the system guarantees – Limited expressivity – Immutability increases update complexity § Samza + Efficient log based state updates + States are well integrated with the operators – Lack of exactly-once semantics – State access is not fully transparent 13Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 14. § Take what’s good, make it work + add some more § Clean and powerful abstractions • Local (Task) state • Partitioned (Key) state § Proper API integration • Java: OperatorState interface • Scala: mapWithState, flatMapWithState… § Exactly-once semantics by checkpointing 14Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 15. Flink Word Count words.keyBy(x => x).mapWithState { (word, count: Option[Int]) => { val newCount = count.getOrElse(0) + 1 val output = (word, newCount) (output, Some(newCount)) } } 15Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 16. Local State § Task scoped state access § Can be used to implement custom access patterns § Typical usage: • Source operators (offset) • Machine learning models • Use cyclic flows to simulate global state access 16Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 17. Local State Example (Java) public class MySource extends RichParallelSourceFunction { // Omitted details private OperatorState<Long> offset; @Override public void run(SourceContext ctx) { Object checkpointLock = ctx.getCheckpointLock(); isRunning = true; while (isRunning) { synchronized (checkpointLock) { offset.update(offset.value() + 1); // ctx.collect(next); } } } } 17Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 18. Partitioned State § Key scoped state access § Highly scalable § Allows for incremental backup/restore § Typical usage: • Any per-key operation • Grouped aggregations • Window buffers 18Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 19. Partitioned State Example (Scala) // Compute the current average of each city's temperature temps.keyBy("city").mapWithState { (in: Temp, state: Option[(Double, Long)]) => { val current = state.getOrElse((0.0, 0L)) val updated = (current._1 + in.temp, current._2 + 1) val avg = Temp(in.city, updated._1 / updated._2) (avg, Some(updated)) } } case class Temp(city: String, temp: Double) 19Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27
  • 20. Exactly-once semantics § Based on consistent global snapshots § Algorithm designed for stateful dataflows 20Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27 Detailed  mechanism
  • 21. Exactly-once semantics § Low runtime overhead § Checkpointing logic is separated from application logic 21Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27 Blogpost  on  streaming  fault-­‐‑tolerance
  • 22. Summary § State is essential to many applications § Fault-tolerant streaming state is a hard problem § There is a trade-off between expressivity vs scalability/fault-tolerance § Flink tries to hit the sweet spot with… • Providing very flexible abstractions • Keeping good scalability and exactly-once semantics 22Apache  Flink  Meetup  @  MapR2015-­‐‑08-­‐‑27