An overview of state management techniques employed in Apache Flink including pipelined consistent snapshots and intuitive usages for reconfiguration, which were presented at vldb 2017.
State Management in Apache Flink : Consistent Stateful Distributed Stream Processing
1. Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd
Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology
Stefan Richter<s.richter@data-artisans.com> - data Artisans
Kostas Tzoumas<kostas@data-artisans.com> - data Artisans
1
State Management in
Apache Flink®
Consistent Stateful Distributed Stream Processing
@vldb17
2. Overview
• The Apache Flink System Architecture
• Pipelined Consistent Snapshots
• Operations with Snapshots
• Large Scale Deployments and Evaluation
2
3. The Apache Flink
Framework
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
3
32. • In use: Storm Trident and Spark Streaming
• A conservative approach, equivalent to batching
• Can cause unnecessary latency (master coordination)
• Processing is no longer continuous
• Forces many tasks to be idle
• Instead, in Apache Flink snapshots are pipelined
Synchronous Snapshots
10
45. Pipelined Snapshots (cycles)
Problem: we cannot wait indefinitely for records in cycles
Solution: log in
snapshot inflight
records within a cycle
Replay upon recovery.
12
46. • Offers exactly-once processing guarantees
• Issued periodically/externally by the user
• Naturally respects flow control mechanisms
• Channel state logging limited to cycles only
• Multiple epoch snapshots can be pipelined
• Can offer weaker at-least-once processing guarantees
by simply dropping aligning vs no alignment cost
Technique Highlights
13
48. Exactly-Once: Input and Processing
Important Assumptions
• Input streams are persisted with offset indexes (e.g., Kafka, Kinesis)
• Data Channels are FIFO and reliable (no loss)
Each epoch either completes or repeats
15
49. • Idempontency ~ repeated operations can be tolerated after
recovery/rollback (works for mutable stores).
• Transactional Processing ~ Requires a two-phase
coordination. A snapshot completion eventually leads to
external commit (e.g., Flink’s HDFS RollingSink*)
in-progress committedpendingpending
epoch n-1 epoch n-2 epoch n-3epoch n
Exactly-Once Output
16
57. Reconfiguration: The Issue
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
19
58. Reconfiguration: The Issue
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
19
59. Reconfiguration: The Issue
case II
0x100: bob
…
…
…
…
0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bob: 0x100
carol: 0x344
…
alice: 0x449
chuck: 0x630
…
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
19
60. Reconfiguration: The Issue
case II
0x100: bob
…
…
…
…
0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bob: 0x100
carol: 0x344
…
alice: 0x449
chuck: 0x630
…
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
too much
19
62. Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
• Snapshot Metadata:
Contains a reference per stored
Key-Group (less metadata)
• Reconfiguration:
Contiguous key-group allocation
to available tasks (less IO)
alice
20
63. Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
• Snapshot Metadata:
Contains a reference per stored
Key-Group (less metadata)
• Reconfiguration:
Contiguous key-group allocation
to available tasks (less IO)
alice
Note: number of key groups controls trade-off between metadata to
keep and reconfiguration speed
20
75. Large Scale Deployment at King100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
25
76. Large Scale Deployment at King100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
25
77. Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
25
78. Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
25
79. Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
• #shuffles (keyby)
• parallelism
25
80. Teaser: More paper
highlights
• We can use the same technique to coordinate
externally managed state with snapshots.
• Epoch markers can act as on-the-fly
reconfiguration points.
• Internals of asynchronous and incremental
snapshots.
26
81. Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd
Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology
Stefan Richter<s.richter@data-artisans.com> - data Artisans
Kostas Tzoumas<kostas@data-artisans.com> - data Artisans
27
State Management in
Apache Flink®
Consistent Stateful Distributed Stream Processing
@vldb17