The document discusses Flink's use of internal data structures to efficiently support checkpointing. It describes how Flink uses RocksDB, a log-structured merge tree database, as a backend to support asynchronous and incremental checkpoints. RocksDB allows checkpoints to be taken with low overhead by creating immutable snapshots of the on-disk data structures. It also facilitates incremental checkpoints by efficiently detecting state changes between checkpoints based on the creation and deletion of immutable sorted string tables. The document provides an example to illustrate how RocksDB integrates with the distributed filesystem and job manager to support incremental checkpointing. It also discusses how Flink uses a copy-on-write hash map approach with the heap state backend to support asynchronous checkpoints while detecting state
4. 4
Take state snapshot
Synchronously trigger
state snapshot (e.g.
copy-on-write)
Flink State and Distributed Snapshots
Stateful
Operation
Source
„Asynchronous Barrier Snapshotting“
5. 5
DFS
Processing pipeline continues
Durably persist
full snapshots
asynchronously
For each state backend:
one writer, n reader
Flink State and Distributed Snapshots
Stateful
Operation
Source
6. Challenges for Data Structures
▪ Asynchronous Checkpoints:
▪ Minimize pipeline stall time while taking
the snapshot.
▪ Keep overhead (memory, CPU,…) as low
as possible while writing the snapshot.
▪ Support multiple parallel checkpoints.
6
8. Full Snapshots
8
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
Checkpoint 1 Checkpoint 2 Checkpoint 3
time
State
Backend
(Task-local)
Checkpoint
Directory
(DFS)
Updates Updates
9. Checkpoint
Directory
(DFS)
Incremental Snapshots
9
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-, icp1)
Δ(icp1, icp2)
Δ(icp2, icp3)
time
State
Backend
(Task-local)
Updates Updates
10. 5 TB 10 TB 15 TB
Incremental Full
Incremental vs Full Snapshots
10
Checkpoint
Duration
State Size
11. State
Backend
(Task-local)
Incremental Recovery
11
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
time
Recovery?
+ +
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-, icp1)
Δ(icp1, icp2)
Δ(icp2, icp3)
Checkpoint
Directory
(DFS)
12. Challenges for Data Structures
▪ Incremental checkpoints:
▪ Efficiently detect the (minimal) set of
state changes between two
checkpoints.
▪ Avoid unbounded checkpoint history.
12
13. Two Keyed State Backends
13
HeapKeyedStateBackend RocksDBKeyedStateBackend
• State lives in memory, on Java heap.
• Operates on objects.
• Think of a hash map {key obj -> state obj}.
• Goes through ser/de during snapshot/restore.
• Async snapshots supported.
• State lives in off-heap memory and on disk.
• Operates on serialized bytes.
• Think of K/V store {key bytes -> state bytes}.
• Based on log-structured-merge (LSM) tree.
• Goes through ser/de for each state access.
• Async and incremental snapshots.
15. What is RocksDB
▪ Ordered Key/Value store (bytes).
▪ Based on LSM trees.
▪ Embedded, written in C++, used via JNI.
▪ Write optimized (all bulk sequential), with
acceptable read performance.
15
16. RocksDB Architecture (simplified)
16
Memory
Local Disk
K: 2 V: T K: 8 V: W
Memtable
- Mutable Memory Buffer for K/V Pairs (bytes)
- All reads / writes go here first
- Aggregating (overrides same unique keys)
- Asynchronously flushed to disk when full
17. RocksDB Architecture (simplified)
17
Memory
Local Disk
K: 4 V: Q
Memtable
K: 2 V: T K: 8 V: WIdx
SSTable-(1)
BF-1
- Flushed Memtable becomes immutable
- Sorted By Key (unique)
- Read: first check Memtable, then SSTable
- Bloomfilter & Index as optimisations
18. RocksDB Architecture (simplified)
18
Memory
Local Disk
K: 2 V: A K: 5 V: N
Memtable
K: 4 V: Q K: 7 V: SIdx
SSTable-(2)
K: 2 V: T K: 8 V: WIdx
SSTable-(1)
K: 2 V: C K: 7 -Idx
SSTable-(3)
BF-1 BF-2 BF-3
- Deletes are explicit entries
- Natural support for snapshots
- Iteration: online merge
- SSTables can accumulate
19. RocksDB Compaction (simplified)
19
K: 2 V: C K: 7 - K: 4 V: Q K: 7 V: S K: 2 V: T K: 8 V: W
K: 2 V: C K: 4 V: Q K: 8 V: W
multiway
sorted
merge
„Log Compaction“
SSTable-(3) SSTable-(1)SSTable-(2)
SSTable-(1, 2, 3)
(rm)
20. RocksDB Asynchronous Checkpoint
▪ Flush Memtable. (Simplified)
▪ Then create iterator over all current
SSTables (they are immutable).
▪ New changes go to Memtable and future
SSTables and are not considered by
iterator.
20
21. RocksDB Incremental Checkpoint
▪ Flush Memtable.
▪ For all SSTable files in local working
directory: upload new files since last
checkpoint to DFS, re-reference other files.
▪ Do reference counting on JobManager for
uploaded files.
21
22. Incremental Checkpoints - Example
▪ Illustrate interactions (simplified) between:
▪ Local RocksDB instance
▪ Remote Checkpoint Directory (DFS)
▪ Job Manager (Checkpoint Coordinator)
▪ In this example: 2 latest checkpoints retained
22
37. 37
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
Entry<K, S>[] table;
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K, S>
„Structural
changes to map“
(easy to detect)
„User modifies
state objects“
(hard to
detect)
Heap Backend: Chained Hash Map
38. Copy-on-Write Hash Map
38
S1
K1
Entry
S2
K2
Entry
next
Map<K, S>
„Structural
changes to map“
(easy to detect)
„User modifies
state objects“
(hard to
detect)
Map<K, S> {
Entry<K, S>[] table;
int mapVersion;
int requiredVersion;
OrderedSet<Integer> snapshots;
}
Entry<K, S> {
final K key;
S state;
Entry next;
int stateVersion;
int entryVersion;
}
39. Copy-on-Write Hash Map - Snapshots
39
Map<K, S> {
Entry<K, S>[] table;
int mapVersion;
int requiredVersion;
OrderedSet<Integer> snapshots;
}
Entry<K, S> {
final K key;
S state;
Entry next;
int stateVersion;
int entryVersion;
}
Create Snapshot:
1. Flat array-copy of table array
2. snapshots.add(mapVersion);
3. ++mapVersion;
4. requiredVersion = mapVersion;
Release Snapshot:
1. snapshots.remove(snapVersion);
2. requiredVersion = snapshots.getMax();
40. Copy-on-Write Hash Map - 2 Golden Rules
1.Whenever a map entry e is modified and e.entryVersion <
map.requiredVersion, first copy the entry and redirect pointers that are
reachable through snapshot to the copy. Set e.entryVersion =
map.mapVersion. Pointer redirection can trigger recursive application of
rule 1 to other entries.
2.Before returning the state object s of entry e to a caller, if e.stateVersion
< map.requiredVersion, create a deep copy of s and redirect the
pointer in e to the copy. Set e.stateVersion = map.mapVersion. Then
return the copy to the caller. Applying rule 2 can trigger rule 1.
40
59. Copy-on-Write Hash Map Benefits
▪ Fine granular, lazy copy-on-write.
▪ At most one copy per object, per snapshot.
▪ No thread synchronisation between readers and writer
after array-copy step.
▪ Supports multiple concurrent snapshots.
▪ Also implements efficient incremental rehashing.
▪ Future: Could serve as basis for incremental snapshots.
59
I would like to start with a very brief recap how state and checkpoints work in Flink to motivate the challenges the impose on the data structures of our backends.
State backend is local to operator so that we can operate up to the speed of local memory variables. But how can we get a consistent snapshot of the distributed state?
Users want to go to 10TB or even up to 50TB state! significant overhead!
Beauty: self contained, can be easily deleted or recovered
RocksDB is an embeddable persistent k/v store for fast storage.
If user modifies state under key 42 3 times, only the latest change is reflected in memtable.
Checkpoint coordinator lives in the JobMaster
State backend working area is local memory / local disk
Checkpointing area is DFS (stable, remote, replicated)
FLUSH memtable as first step, so that all deltas are on disk
SST files are immutable, their creation and deletion is the delta.
we have a shared folder for the deltas (sstables) that belong to the operator instance
Important new step, we can only reference successful incremental checkpoints in future checkpoints!
New set of sst tables, 225 is missing, so is 227… we can assume the got merged and consolidated in 228 or 229.
Think of events of user interactions, key is a user id and we store a click count for each user. We talk about keyed state.