Tech Talk @ Google on Flink Fault Tolerance and HA

Apache Flink Streaming
Resiliency and Consistency
Paris Carbone - PhD Candidate
KTH Royal Institute of Technology
<parisc@kth.se, senorcarbone@apache.org>
1
Technical Talk @ Google

Overview
• The Flink Execution Engine Architecture
• Exactly-once-processing challenges
• Using Snapshots for Recovery
• The ABS Algorithm for DAGs and Cyclic Dataﬂows
• Recovery, Cost and Performance
• Job Manager High Availability
2

Current Focus
3
Streaming APIBatch API
Flink Optimiser
Flink Runtime
Table
ML
Gelly
ML
Gelly
StateManagement

Executing Pipelines
• Streaming pipelines
translate into job
graphs
4
Flink Runtime
Flink Job Graph Builder/Optimiser
Flink ClientUser
Pipeline
Job Graph

Unbounded Data
Processing Architectures
5
(Hadoop, Spark) (Spark Streaming)
1) Streaming (Distributed Data Flow)
LONG-LIVED TASK EXECUTION
STATE IS KEPT INSIDE
TASKS
2) Micro-Batch

The Flink Runtime Engine
• Tasks run operator
logic in a pipelined
fashion
• They are scheduled
among workers
• State is kept within
tasks
6
LONG-LIVED TASK EXECUTION
Job Manager
• scheduling
• monitoring

Task Failures
• Task failures are guaranteed to occur
• We need to make them transparent
• Can we simply recover a task from scratch?
7
task
state
consistent
state
no duplicatesno loss

Record Acknowledgements
• We can monitor the
consumption of every event
• It guarantees no loss
• Duplicates and inconsistent
state are not handled
8
task
state
consistent
state
bookkeeping

Atomic transactions per
update
• It guarantees no loss and
consistent state.
• No duplication can be
achieved e.g. with bloom
ﬁlters or caching
• Fine grained recovery
• Non-constant load in external
storage can lead to
unsustainable execution
9
task
state
consistent
state
state
state
state
Distributed
Storage
Trident

Lessons Learned from
Batch
10
batch-1batch-2
• If a batch computation fails, simply repeat computation
as a transaction
• Transaction rate is constant
• Can we apply these principles to a true streaming
execution?

Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reﬂects a
moment at a valid execution”

11
p1
p2
p3
p4

11
p1
p2
p3
p4
consistent
cut

11
p1
p2
p3
p4
consistent
cut
causality violation

11
p1
p2
p3
p4
consistent
cut
inconsistent
cut
causality violation

11
p1
p2
p3
p4
consistent
cut
inconsistent
cut
causality violation
Idea: We can resume from a snapshot
that deﬁnes a consistent cut

11
Assumptions
• repeatable sources
• reliable FIFO channels
p1
p2
p3
p4
consistent
cut
inconsistent
cut
causality violation
Idea: We can resume from a snapshot
that deﬁnes a consistent cut

12
t1
snap - t1

12
t2t1
snap - t1 snap - t2

12
t3t2t1
snap - t1 snap - t2

12
t3t2t1
reset from snap t2
snap - t1 snap - t2

12
t3t2t1
reset from snap t2
snap - t1 snap - t2
Assumptions
• repeatable sources
• reliable FIFO channels

Taking Snapshots
13
t2t1
execution snapshots
Initial approach (see Naiad)
• Pause execution on t1,t2,..
• Collect state
• Restore execution

Lamport On the Rescue
14
“The global-state-detection algorithm is to be
superimposed on the underlying computation:
it must run concurrently with, but not alter, this
underlying computation”

Chandy-Lamport Snapshots
15
p1 p2
p1
Using Markers/Barriers
• Triggers Snapshots
• Separates preshot-postshot events
• Leads to a consistent execution cut
markers propagate the snapshot
execution under continuous ingestion markers

Asynchronous Snapshots
16
t2t1
snap - t1 snap - t2
snapshotting snapshotting
propagating markers

Asynchronous Barrier
Snapshotting
17
barriers

Snapshotting
18

Snapshotting
19

Snapshotting
20

Snapshotting
21

Snapshotting
22

Snapshotting
22
pending

Snapshotting
22
aligning
pending

Snapshotting
23
aligning
aligning

Snapshotting
24
aligning

Snapshotting
25
aligning

Snapshotting
26

Snapshotting Beneﬁts
27
• Taking advantage of the execution graph structure
• No records in transit included in the snapshot
• Aligning has lower impact than halting

Snapshots on Cyclic
Dataﬂows
28

Snapshots on Cyclic
Dataﬂows
29

Snapshots on Cyclic
Dataﬂows
30

Snapshots on Cyclic
Dataﬂows
31

Snapshots on Cyclic
Dataﬂows
31
deadlock

Snapshots on Cyclic
Dataﬂows
• Checkpointing should eventually terminate
• Records in transit within loops should be included
in the checkpoint

Snapshots on Cyclic
Dataﬂows
33

Snapshots on Cyclic
Dataﬂows
34

Snapshots on Cyclic
Dataﬂows
35
downstream
backup

Snapshots on Cyclic
Dataﬂows
36
downstream
backup

Snapshots on Cyclic
Dataﬂows
37

Implementation in Flink
38
• Coordinator/Driver Actor per job in JM
• sends periodic markers to sources
• collects acknowledgements with state handles and registers complete
snapshots
• injects references to latest consistent state handles and back-edge
records in pending execution graph tasks
• Tasks
• snapshot state transparently in given backend upon receiving a barrier
and send ack to coordinator
• propagate barriers further like normal records

Recovery
39
• Full rescheduling of the execution graph from the
latest checkpoint - simple, no further modiﬁcations
• Partial rescheduling of the execution graph for all
upstream dependencies - trickier, duplicate
elimination should be guaranteed

Performance Impact
40
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-ﬂink/

Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task
Client
job graph
optimiser
JM State
• Pending Execution Graphs
• Snapshots (State Handles)

Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Client
job graph
optimiser
JM State

Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Client
job graph
optimiser
JM State
Single Point Of Failure

Passive Standby JM
42
JM JM JM
Zookeeper State
• Leader JM Address
• State Snapshots
Client

Eventual Leader Election
43
time

43
JM JM JMt0
time

43
JM JM JMt0
t1 JM JM JM
time

43
JM JM JMt0
t1
t2
JM JM JM
JM JM JM
…recovering
time

43
JM JM JMt0
t1
t2
t3
JM JM JM
JM JM JM
…recovering
JM JMJM
time

Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeper

Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1

Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1
execution
graph

Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1
execution
graph
blocking write
in Zookeeper

Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1
execution
graph
task
blocking write
in Zookeeper

Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Client
optimiser
Zookeeperjob-1
execution
graph
task
blocking write
in Zookeeper

Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynchronous writes
in Zookeeper

Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynchronous writes
in Zookeeper
Checkpointing
states

Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynchronous writes
in Zookeeper
state handle
Checkpointing
states

Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynchronous writes
in Zookeeper
global snapshot
state handle
Checkpointing
states

HA in Zookeeper
• Task Managers query ZK for the current leader before
connecting (with lease)
• Upon standby failover restart pending execution graphs
• Inject state handles from the last global snapshot

Summary
• The Flink execution monitors long running tasks
• Establishing exactly-once-processing guarantees with
snapshots can be achieved without halting the execution
• The ABS algorithm persists the minimal state at a low cost
• Master HA is achieved through Zookeeper and passive
failover

Tech Talk @ Google on Flink Fault Tolerance and HA

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Tech Talk @ Google on Flink Fault Tolerance and HA

Ähnlich wie Tech Talk @ Google on Flink Fault Tolerance and HA (20)

Mehr von Paris Carbone

Mehr von Paris Carbone (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tech Talk @ Google on Flink Fault Tolerance and HA