1
Tzu-Li (Gordon) Tai
@tzulitai
tzulitai@apache.org
Flink Meetup @ Idealo GmbH
Stateful Stream Processing
with Apache Flink
2
Original creators of Apache
Flink®
Providers of the
dA Platform, a supported
Flink distribution
This will be about ...
▪ Why and what is stateful stream processing?
▪ How Flink enables stateful stream processing:
Runti...
4
The ol’ traditional batch way
5
t
Batch
jobs
...
● Continuously
ingesting data
● Time-bounded
batch files
● Periodic batch...
The ol’ traditional batch way (II)
6
● Consider computing
conversion metric
(# of A → B per hour)
● What if the conversion...
The ideal way
7
accumulate state
● View of the “history” of the
input stream
● counters, in-progress
windows
● parameters ...
The ideal way (II)
8
Stateful Stream Processor
that handles
consistently, robustly, and efficiently
Large
Distributed Stat...
Stream Processing
9
Your
Code
process records
one-at-a-time
...
Long running computation, on an endless stream of input
Distributed Stream Processing
10
Your
Code
...
...
...
Your
Code
Your
Code
● partitions input streams by
some key in the d...
Stateful Stream Processing
11
...
...
Your
Code
Your
Code
Process
update local
variables/structures
var x = …
if (conditio...
Stateful Stream Processing
12
...
...
Your
Code
Your
Code
Process
update local
variables/structures
var x = …
if (conditio...
State fault tolerance
13
Fault tolerance concerns for a stateful stream processor:
● How to ensure exactly-once semantics ...
Fault tolerance: simple case
14
event log
single process
main memoryperiodically snapshot
the memory
Fault tolerance: simple case
15
Recovery: restore
snapshot and replay
events since snapshot
event log
persists events
(tem...
State fault tolerance
16
Fault tolerance concerns for a stateful stream processor:
● How to ensure exactly-once semantics ...
State fault tolerance (II)
17
...
Your
Code
Your
Code
Your
Code
State
State
State
Your
Code
State
● Consistent snapshottin...
State fault tolerance (II)
18
...
Your
Code
Your
Code
Your
Code
State
State
State
Your
Code
State
File System
Checkpoint
●...
State fault tolerance (III)
19
...
Your
Code
Your
Code
Your
Code
State
State
State
Your
Code
State
File System
Restore
● R...
State management
20
Two important management concerns for a long-running job:
● Can I change / fix bugs in my streaming pi...
About time ...
21
...
...
Your
Code
Your
Code
When should results be emitted?
● Control for determining when the computati...
So, in a nutshell ...
22
● State fault tolerance: exactly-once semantics, distributed state snapshotting
● Extremely large...
So, in a nutshell ...
23
● State fault tolerance: exactly-once semantics, distributed state snapshotting
● Extremely large...
24
Apache Flink Stack
25
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flo...
Programming Model
26
Computation
Computation
Computation
Computation
Source Source
Sink
Sink
Transformation
state
state
st...
API and Execution
27
Source
DataStream<String> lines = env.addSource(new FlinkKafkaConsumer010(…));
DataStream<Event> even...
Distributed Runtime
28
Levels of abstraction
29
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic table...
Process Function Blueprint
30
class MyFunction extends ProcessFunction[MyEvent, Result] {
<STATE DECLARATIONS>
<PROCESS>
<...
State Declarations
31
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in the program
l...
Process Element / Fire Timers
32
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in th...
Timer Callback
33
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in the program
lazy ...
Distributed Snapshots
34
Coordination via markers, injected into the
streams
Distributed snapshots
35
State index
(Hash Table
or
RocksDB)
Events flow without replication or synchronous writes
statefu...
Distributed snapshots
36
Trigger checkpoint
Inject checkpoint
barrier
stateful
operation
source
Distributed snapshots
37
stateful
operation
source
Take state
snapshot
Trigger state
copy-on-write
Distributed snapshots
38
stateful
operation
source
Durably persist
snapshots
asynchronously
Processing pipeline continues
Time: Watermarks
39
● Also a coordination of
special markers, called
Watermarks
● A watermark of
timestamp t indicates
tha...
Time: Watermarks in parallel
40
State management: Savepoints
41
▪ A persistent snapshot of all state
▪ When starting an application, state can be initiali...
State management: Savepoints
42
▪ No stateless point-in-time
State management: Savepoints
43
▪ Before downtime, take a savepoint of the job
State management: Savepoints
44
▪ Recover job with state, catch up with event-time processing
with event-time awareness, a...
Rescaling State / Elasticity
▪ Similar to consistent hashing
▪ Split key space into key
groups
▪ Assign key groups to task...
Rescaling State / Elasticity
▪ Rescaling changes key
group assignment
▪ Maximum parallelism
defined by #key groups
47
Evolution of Programming APIs
48
Evolution of Large State Handling
49
50
TL; DR
51
● Stateful stream processing as a paradigm for continuous
data
● Apache Flink is a sophisticated and battle-test...
The application hierarchy
52
Event-driven applications
(event sourcing, CQRS)
Stream Processing / Analytics
Batch Processi...
5
Thank you!
@tzulitai
@ApacheFlink
@dataArtisans
We are hiring!
data-artisans.com/careers
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Nächste SlideShare
Wird geladen in …5
×

Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink

463 Aufrufe

Veröffentlicht am

As Apache Flink continues to push the boundaries of stateful stream processing as an integral part of its past releases, increasing numbers of users are starting to realize the potential of stateful stream processing as a promising paradigm for robust and reactive data analytics as well as event-driven applications.

This talk aims at covering the general idea and motivations of stateful stream processing, and how Flink enables it with its powerful set of state management features and programming APIs. In addition to that, we will also take a look at the recent advancements related to Flink's state management and large state handling that were driven by our team at data Artisans team in the latest version 1.3 (expected release by end of May / early June).

Veröffentlicht in: Daten & Analysen
0 Kommentare
5 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
463
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
1
Aktionen
Geteilt
0
Downloads
24
Kommentare
0
Gefällt mir
5
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink

  1. 1. 1 Tzu-Li (Gordon) Tai @tzulitai tzulitai@apache.org Flink Meetup @ Idealo GmbH Stateful Stream Processing with Apache Flink
  2. 2. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution
  3. 3. This will be about ... ▪ Why and what is stateful stream processing? ▪ How Flink enables stateful stream processing: Runtime & API ▪ How Flink has progressed w.r.t. stateful stream processing 3
  4. 4. 4
  5. 5. The ol’ traditional batch way 5 t Batch jobs ... ● Continuously ingesting data ● Time-bounded batch files ● Periodic batch jobs
  6. 6. The ol’ traditional batch way (II) 6 ● Consider computing conversion metric (# of A → B per hour) ● What if the conversion crossed time boundaries? → carry intermediate results to next batch ● What if events come out of order? → ??? t ...
  7. 7. The ideal way 7 accumulate state ● View of the “history” of the input stream ● counters, in-progress windows ● parameters of incrementally trained ML models, etc. ● State influences the output ● Output depending on notions of time ● Outputs when results are complete. long running computation
  8. 8. The ideal way (II) 8 Stateful Stream Processor that handles consistently, robustly, and efficiently Large Distributed State Time / Order / Completeness ● Stateful stream processing as a new paradigm to continuously process continuous data ● Produce accurate results ● Results available in real-time is only a natural consequence of the model servicing
  9. 9. Stream Processing 9 Your Code process records one-at-a-time ... Long running computation, on an endless stream of input
  10. 10. Distributed Stream Processing 10 Your Code ... ... ... Your Code Your Code ● partitions input streams by some key in the data ● distributes computation across multiple instances ● Each instance is responsible for some key range
  11. 11. Stateful Stream Processing 11 ... ... Your Code Your Code Process update local variables/structures var x = … if (condition(x)) { … }
  12. 12. Stateful Stream Processing 12 ... ... Your Code Your Code Process update local variables/structures var x = … if (condition(x)) { … } ● embedded local state backend ● State co-partitioned with the input stream by key
  13. 13. State fault tolerance 13 Fault tolerance concerns for a stateful stream processor: ● How to ensure exactly-once semantics for the state?
  14. 14. Fault tolerance: simple case 14 event log single process main memoryperiodically snapshot the memory
  15. 15. Fault tolerance: simple case 15 Recovery: restore snapshot and replay events since snapshot event log persists events (temporarily) single process
  16. 16. State fault tolerance 16 Fault tolerance concerns for a stateful stream processor: ● How to ensure exactly-once semantics for the state? ● How to create consistent snapshots of distributed embedded state? ● More importantly, how to do it efficiently without abrupting computation?
  17. 17. State fault tolerance (II) 17 ... Your Code Your Code Your Code State State State Your Code State ● Consistent snapshotting:
  18. 18. State fault tolerance (II) 18 ... Your Code Your Code Your Code State State State Your Code State File System Checkpoint ● Consistent snapshotting:
  19. 19. State fault tolerance (III) 19 ... Your Code Your Code Your Code State State State Your Code State File System Restore ● Recover all embedded state ● Reset position in input stream
  20. 20. State management 20 Two important management concerns for a long-running job: ● Can I change / fix bugs in my streaming pipeline? How do I handle job downtime? ● Can I rescale (change parallelism of) my computation?
  21. 21. About time ... 21 ... ... Your Code Your Code When should results be emitted? ● Control for determining when the computation has fully processed all required events ● It’s mostly about time. e.g. have I received all events for 3 - 4 pm? ● Did event B occur within 5 minutes from event A? ● Wall clock time is not correct. Event-time awareness is required.
  22. 22. So, in a nutshell ... 22 ● State fault tolerance: exactly-once semantics, distributed state snapshotting ● Extremely large state (TB+ scale): out-of-core state, efficiently snapshotting large state ● State management: handling state w.r.t. job changes & rescales ● Event-time aware: controlling when results are complete and emitted w.r.t. event-time
  23. 23. So, in a nutshell ... 23 ● State fault tolerance: exactly-once semantics, distributed state snapshotting ● Extremely large state (TB+ scale): out-of-core state, efficiently snapshotting large state ● State management: handling state w.r.t. job changes & rescales ● Event-time aware: controlling when results are complete and emitted w.r.t. event-time Apache Flink, as a stateful stream processor, is fundamentally designed to address and provide exactly these ;)
  24. 24. 24
  25. 25. Apache Flink Stack 25 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  26. 26. Programming Model 26 Computation Computation Computation Computation Source Source Sink Sink Transformation state state state state
  27. 27. API and Execution 27 Source DataStream<String> lines = env.addSource(new FlinkKafkaConsumer010(…)); DataStream<Event> events = lines.map(line -> parse(line)); DataStream<Statistic> stats = stream .keyBy("id") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()); stats.addSink(new BucketingSink(path)); map() [1] keyBy()/ window()/ apply() [1] Transformation Transformation Sink Streaming DataflowkeyBy()/ window()/ apply() [2] map() [1] map() [2] Source [1] Source [2] Sink [1]
  28. 28. Distributed Runtime 28
  29. 29. Levels of abstraction 29 Process Function (events, state, time) DataStream API (streams, windows) Table API (dynamic tables) Stream SQL low-level (stateful stream processing) stream processing & analytics declarative DSL high-level langauge
  30. 30. Process Function Blueprint 30 class MyFunction extends ProcessFunction[MyEvent, Result] { <STATE DECLARATIONS> <PROCESS> <ON TIMER CALLBACK> }
  31. 31. State Declarations 31 class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) <PROCESS> <ON TIMER CALLBACK> }
  32. 32. Process Element / Fire Timers 32 class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } <ON TIMER CALLBACK> }
  33. 33. Timer Callback 33 class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } }
  34. 34. Distributed Snapshots 34 Coordination via markers, injected into the streams
  35. 35. Distributed snapshots 35 State index (Hash Table or RocksDB) Events flow without replication or synchronous writes stateful operation source
  36. 36. Distributed snapshots 36 Trigger checkpoint Inject checkpoint barrier stateful operation source
  37. 37. Distributed snapshots 37 stateful operation source Take state snapshot Trigger state copy-on-write
  38. 38. Distributed snapshots 38 stateful operation source Durably persist snapshots asynchronously Processing pipeline continues
  39. 39. Time: Watermarks 39 ● Also a coordination of special markers, called Watermarks ● A watermark of timestamp t indicates that no records with timestamp < t should be expected
  40. 40. Time: Watermarks in parallel 40
  41. 41. State management: Savepoints 41 ▪ A persistent snapshot of all state ▪ When starting an application, state can be initialized from a savepoint ▪ In-between savepoint and restore we can update Flink version or user code
  42. 42. State management: Savepoints 42 ▪ No stateless point-in-time
  43. 43. State management: Savepoints 43 ▪ Before downtime, take a savepoint of the job
  44. 44. State management: Savepoints 44 ▪ Recover job with state, catch up with event-time processing with event-time awareness, allows faster-than-real-time reprocessing
  45. 45. Rescaling State / Elasticity ▪ Similar to consistent hashing ▪ Split key space into key groups ▪ Assign key groups to tasks Key space Key group #1 Key group #2 Key group #3 Key group #4
  46. 46. Rescaling State / Elasticity ▪ Rescaling changes key group assignment ▪ Maximum parallelism defined by #key groups
  47. 47. 47
  48. 48. Evolution of Programming APIs 48
  49. 49. Evolution of Large State Handling 49
  50. 50. 50
  51. 51. TL; DR 51 ● Stateful stream processing as a paradigm for continuous data ● Apache Flink is a sophisticated and battle-tested stateful stream processor with a comprehensive set of features ● Efficiency, management, and operational issues for state are taken very seriously
  52. 52. The application hierarchy 52 Event-driven applications (event sourcing, CQRS) Stream Processing / Analytics Batch Processing Stateful, event-driven, event-time-aware processing (streams, windows, …) (data sets) further reading: see “Building a Stream Processor for Event-Driven Applications” by Stephan Ewen
  53. 53. 5 Thank you! @tzulitai @ApacheFlink @dataArtisans
  54. 54. We are hiring! data-artisans.com/careers

×