Deep Dive on Apache Flink State and Checkpointing

© 2019 Ververica
Seth Wiesman, Solutions Architect
Deep Dive on Apache Flink State

© 2019 Ververica
Agenda
• Serialization
• State Backends
• Checkpoint Tuning
• Schema Migration
• Upcoming Features
3

© 2019 Ververica
Flink’s Serialization System
• Natively Supported Types
• Primitive Types
• Tuples, Scala Case Classes
• Pojo’s
• Unsupported Types Fall Back to Kryo
5

© 2019 Ververica
Flink’s Serialization System
Benchmark Results For Flink 1.8
6
Serializer Ops/s
PojoSerializer 305 / 293*
RowSerializer 475
TupleSerializer 498
Kryo 102 / 67*
Avro (Reflect API) 127
Avro (SpecificRecord API) 297
Protobuf (via Kryo) 376
Apache Thrift (via Kryo) 129 / 112*
public static class MyPojo {
  public int id;
  private String name;
  private String[] operationNames;
  private MyOperation[] operations;
  private int otherId1;
  private Object someObject; // used with String
}
MyOperation {
  int id;
  protected String name;
}

© 2019 Ververica
Custom Serializers
• registerKryoType(Class<?>)
• Registers a type with Kryo for more compact binary format
• registerTypeWithKryoSerializer(Class<?>, Class<? extends Serializer>)
• Provides a default serializer for the given class
• Provided serializer class must extends com.esotericsoftware.kryo.Serializer
• addDefaultKryoSerializer(Class<?>, Serializer<?> serializer)
• Registers a serializer as the default serializer for the given type
Registration with Kryo via ExecutionConfig
7

© 2019 Ververica
Custom Serializer’s
@TypeInfo Annotation
8
@TypeInfo(MyTupleTypeInfoFactory.class)
public class MyTuple<T0, T1> {
  public T0 myﬁeld0;
  public T1 myﬁeld1;
}
public class MyTupleTypeInfoFactory extends TypeInfoFactory<MyTuple> {
  @Override
  public TypeInformation<MyTuple> createTypeInfo(Type t, Map<String, TypeInformation<?>> genericParameters) {
    return new MyTupleTypeInfo(genericParameters.get("T0"), genericParameters.get("T1"));

}

© 2019 Ververica
State Backends

© 2019 Ververica10
Task Manager Process Memory Layout
Task Manager JVM Process
Java Heap
Off Heap / Native
Flink Framework etc.
Network Buffers
Timer State
Keyed State
Typical Size

© 2019 Ververica11
Java Heap
Off Heap / Native
Network Buffers
Timer State
Keyed State
Typical Size

© 2019 Ververica12
Java Heap
Off Heap / Native
Network Buffers
Timer State
Keyed State
Typical Size

© 2019 Ververica13
Keyed State Backends
Based on Java Heap Objects Based on RocksDB

© 2019 Ververica
Heap Keyed State Backend
• State lives as Java objects on the heap
• Organized as chained hash table, key ↦ state
• One hash table per registered state
• Supports asynchronous state snapshots
• Data is de / serialized only during state snapshot and restore
• Highest Performance
• Affected by garbage collection overhead / pauses
• Currently no incremental checkpoints
• High memory overhead of representation
• State is limited by available heap memory
14

© 2019 Ververica
Heap State Table Architecture
15
- Hash buckets (Object[]), 4B-8B per slot
- Load factor <= 75%
- Incremental rehash
Entry
Entry
Entry

© 2019 Ververica
Heap State Table Architecture
16
- Hash buckets (Object[]), 4B-8B per slot
- Load factor <= 75%
- Incremental rehash
Entry
Entry
Entry
▪ 4 References:
▪ Key
▪ Namespace
▪ State
▪ Next
▪ 3 int:
▪ Entry Version
▪ State Version
▪ Hash Code
K
N
S
4 x (4B-8B)
+3 x 4B
+ ~8B-16B (Object overhead)
Object sizes and
overhead.
Some objects might
be shared.

© 2019 Ververica
Heap State Table Snapshot
17
Original Snapshot
A C
B
Entry
Entry
Entry
Copy of hash bucket array is snapshot overhead

© 2019 Ververica
18
Original Snapshot
A C
B
D
No conflicting modification = no overhead

© 2019 Ververica
19
Original Snapshot
A’ C
B
D A
Modifications trigger deep copy of entry - only as much as required. This depends on
what was modified and what is immutable (as determined by type serializer).
Worst case overhead = size of original at time of snapshot.

© 2019 Ververica
Heap Backend Tuning Considerations
• Choose TypeSerializers with efficient copy-methods
• Flag immutability of objects where possible to avoid copy completely
• Flatten POJOs / avoid deep objects
• Reduces object overheads and following references
• GC choice / tuning
• Scale out using multiple task managers per node
20

© 2019 Ververica
RocksDB Keyed State Backend Characteristics
• State lives as serialized byte-strings in off-heap memory and on local disk
• One column family per registered state (~table)
• Key / Value store, organized as a log-structured merge tree (LSM tree)
• Key: serialized bytes of <keygroup, key, namespace>
• LSM naturally supports MVCC
• Data is de / serialized on every read and update
• Not affected by garbage collection
• Relatively low overhead of representation
• LSM naturally supports incremental snapshots
• State size is limited by available local disk space
• Lower performance (~ order of magnitude compared to Heap state backend)
21

© 2019 Ververica
RocksDB Architecture
22
Local Disk
WAL
WAL
Compaction
Memory Persistent Store
Flush
In Flink:
- disable WAL and sync
- persistence via checkpointsActive
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge

© 2019 Ververica
23
Local Disk
WAL
WAL
Compaction
Flush
In Flink:
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge
Set per column
family (~table)

© 2019 Ververica
24
ReadOp
Local Disk
WAL
WAL
Flush
Merge
Active
MemTable
ReadOnly
MemTable
Full/Switch
WriteOp
SST SST
SSTSST
In Flink:
- persistence via checkpoints

© 2019 Ververica
25
ReadOp
Local Disk
WAL
WAL
Flush
Merge
Active
MemTable
ReadOnly
MemTable
Full/Switch
WriteOp
SST SST
SSTSST
In Flink:
MemTable
ReadOnly
MemTable
WriteOp
ReadOp
Local Disk
WAL
WAL
Compaction
Full/Switch
Read Only
Block Cache
Flush
SST SST
SSTSST
Merge
In Flink:
- persistence via checkpoints

© 2019 Ververica
RocksDB Resource Consumption
• One RocksDB instance per operator subtask
• block_cache_size
• Size of the block cache
• write_buffer_size
• Max size of a MemTable
• max_write_buffer_number
• The maximum number of MemTable’s allowed in memory before flush to SST file
• Indexes and bloom filters
• Optional
• Table Cache
• Caches open file descriptors to SST files
• Default: unlimited!
26

© 2019 Ververica
Performance Tuning
Amplification Factors
27
Write Amplification
Read Amplification Space Amplification
More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Parameter
Space

© 2019 Ververica
Performance Tuning
Amplification Factors
28
Write Amplification
Read Amplification Space Amplification
More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Parameter
Space
Example: More compaction effort =
increased write amplification
and reduced read amplification

© 2019 Ververica
General Performance Considerations
• Use efficient TypeSerializer’s and serialization formats
• Decompose user code objects
• ValueState<List<Integer>> ListState<Integer>
• ValueState<Map<Integer, Integer>> MapState<Integer, Integer>
• Use the correct configuration for your hardware setup
• Consider enabling RocksDB native metrics to profile your applications
• File Systems
• Working directory on fast storage, ideally local SSD. Could even be memory.
• EBS performance can be problematic
29

© 2019 Ververica
Timer Service

© 2019 Ververica
Heap Timers
31
▪ 2 References:
▪ Key
▪ Namespace
▪ 1 long:
▪ Timestamp
▪ 1 int:
▪ Array Index
K
N
Object sizes and
overhead.
Some objects might
be shared.
Binary heap of timers in array
Peek: O(1)
Poll: O(log(n))
Insert: O(log(n))
Delete: O(n)
Contains O(n)
Timer

© 2019 Ververica
Heap Timers
32
▪ 2 References:
▪ Key
▪ Namespace
▪ 1 long:
▪ Timestamp
▪ 1 int:
▪ Array Index
K
N
Object sizes and
overhead.
Some objects might
be shared.
HashMap<Timer, Timer> : fast deduplication and deletes
Key Value
Peek: O(1)
Poll: O(log(n))
Insert: O(log(n))
Delete: O(log(n))
Contains O(1)
MapEntry
Timer

© 2019 Ververica
Heap Timers
33
HashMap<Timer, Timer> : fast deduplication and deletes
MapEntry
Key Value
Snapshot (net values of a timer are immutable)
Timer

© 2019 Ververica
RocksDB Timers
34
0 20 A X
0 40 D Z
1 10 D Z
1 20 C Y
2 50 B Y
2 60 A X
…
…
Key
Group
Time
stamp
Key
Name
space
…
Lexicographically ordered
byte sequences as key, no value
Column Family - only key, no value

© 2019 Ververica
RocksDB Timers
35
0 20 A X
0 40 D Z
1 10 D Z
1 20 C Y
2 50 B Y
2 60 A X
…
…
Key
Group
Time
stamp
Key
Name
space
Column Family - only key, no value
Key group queues
(caching first k timers)
Priority queue of
key group queues

© 2019 Ververica
3 Task Manager Memory Layout
36
Off Heap / Native
Network Buffers
Timer State
Keyed State
Java Heap
Off Heap / Native
Network Buffers
Timer State
Keyed State
Java Heap
Off Heap / Native
Network Buffers
Keyed State
Timer State

© 2019 Ververica
Full Checkpoint
38
G
H
C
D
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
@t1 @t2 @t3
A
F
C
D
E
G
H
C
D
I
E

© 2019 Ververica
Full Checkpoint Overview
• Creation iterates and writes full database snapshots as a stream to stable storage
• Restore reads data as a stream from stable storage and re-inserts into the state backend
• Each checkpoint is self contained, and size is proportional to the size of full state
• Optional: compression with snappy
39

© 2019 Ververica
Incremental Checkpoint
40
H
C
D
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
E
F
G
H
I
@t1 @t2 @t3
builds upon builds upon
𝚫𝚫 𝚫

© 2019 Ververica
Incremental Checkpoints with RocksDB
41
Local Disk
WAL
WAL
Compaction
Flush
Incremental checkpoint:
Observe created/deleted
SST files since last checkpoint
Active
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge

© 2019 Ververica
Incremental Checkpoint Overview
• Expected trade-off: faster* checkpoints, slower recovery
• Creation only copies deltas (new local SST files) to stable storage
• Creates write amplification because we also upload compacted SST files so that we can prune checkpoint
history
• Sum of all increments that we read from stable storage can be larger than the full state size
• No rebuild is required because we simply re-open the RocksDB backend from the SST files
• SST files are snappy compressed by default
42

© 2019 Ververica
Anatomy of a Flink Stream Job Upgrade
44
Flink job user code
Local State Backend
Persistent Savepoint
local reads / writes that 
manipulate state

Deep Dive on Apache Flink State and Checkpointing

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep Dive on Apache Flink State and Checkpointing

Ähnlich wie Deep Dive on Apache Flink State and Checkpointing (20)

Mehr von Ververica

Mehr von Ververica (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep Dive on Apache Flink State and Checkpointing