Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
Axa Assurance Maroc - Insurer Innovation Award 2024
Apache Flink Deep Dive: Job Life-Cycle, Optimizer, Delta Iterations
1. Apache Flink
Deep Dive
Vasia Kalavri
Flink Committer & KTH PhD student
vasia@apache.org
1st Apache Flink Meetup Stockholm
May 11, 2015
2. Flink Internals
● Job Life-Cycle
○ what happens after you submit a Flink job?
● The Batch Optimizer
○ how are execution plans chosen?
● Delta Iterations
○ how are Flink iterations special for Graph and ML
apps?
2
4. The Flink Stack
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
*current Flink master + few PRs
Streaming Optimizer
4
5. DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2(token, 1));
})
.groupBy(0).aggregate(SUM, 1);
1
3
2
Program Life-Cycle
4
5
6. Task
Manager
Job
Manager
Task
Manager
Flink Client &
Optimizer
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2(token, 1));
})
.groupBy(0).aggregate(SUM, 1);
O Romeo,
Romeo,
wherefore art
thou Romeo?
O, 1
Romeo, 3
wherefore, 1
art, 1
thou, 1
6
Nor arm, nor
face, nor any
other part
nor, 3
arm, 1
face, 1,
any, 1,
other, 1
part, 1
creates and submits
the job graph
creates the execution
graph and deploys tasks
execute tasks and send
status updates
7. Input First SecondX Y
Operator X Operator Y
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first.filter (str -> str.length() > 40);
second.print()
env.execute();
Series of Transformations
7
8. DataSet Abstraction
Think of it as a collection of data elements that can be
produced/recovered in several ways:
… like a Java collection
… like an RDD
… perhaps it is never fully materialized (because the program does not
need it to)
… implicitly updated in an iteration
→ this is transparent to the user
8
10. Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
Stage 1:
Create/cache Log
Subsequent stages:
Grep log for matches
Caching in-memory
and disk if needed
Staged (batch) execution
10
11. Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
001100110011001100110011
Stage 1:
Deploy and start operators
Data transfer in-memory
and disk if needed
Note: Log
DataSet is
never
“created”!
Pipelined execution
11
18. ● Evaluates physical execution strategies
○ e.g. hash-join vs. sort-merge join
● Chooses data shipping strategies
○ e.g. broadcast vs. partition
● Reuses partitioning and sort orders
● Decides to cache loop-invariant data in
iterations
Optimization Examples
18
19. case class PageVisit(url: String, ip: String, userId: Long)
case class User(id: Long, name: String, email: String, country: String)
// get your data from somewhere
val visits: DataSet[PageVisit] = ...
val users: DataSet[User] = ...
// filter the users data set
val germanUsers = users.filter((u) => u.country.equals("de"))
// join data sets
val germanVisits: DataSet[(PageVisit, User)] =
// equi-join condition (PageVisit.userId = User.id)
visits.join(germanUsers).where("userId").equalTo("id")
Example: Distributed Joins
The join operator needs to
create all the pairs of
elements from the two
inputs, for which the join
condition evaluates to true
19
20. Example: Distributed Joins
● Ship Strategy: The input data is distributed across all
parallel instances that participate in the join
● Local Strategy: Each parallel instance performs a join
algorithm on its local partition
For both steps, there are multiple valid strategies which are
favorable in different situations.
20
21. Repartition-Repartition Strategy
Partitions both inputs
using the same
partitioning function.
All elements that share
the same join key are
shipped to the same
parallel instance and can
be locally joined.
21
22. Broadcast-Forward Strategy
Sends one complete data
set to each parallel
instance that holds a
partition of the other data.
The other Dataset
remains local and is not
shipped at all.
22
23. The optimizer will compute cost estimates for execution
plans and will pick the “cheapest” plan:
● amount of data shipped over the the network
● if the data of one input is already partitioned
R-R Cost: Full shuffle of both data sets over the network
B-F Cost: Depends on the size of the dataset that is
broadcasted and the number of parallel instances
Read more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
How does the Optimizer choose?
23
25. ● for/while loop in client submits one job per
iteration step
● Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
Iterate by unrolling
25
26. Native Iterations
● the runtime is aware of the iterative execution
● no scheduling overhead between iterations
● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work
“out of the loop”
Maintain state as index
26
27. Flink Iteration Operators
Iterate IterateDelta
Input
Iterative
Update Function
Result
Replace
Workset
Iterative
Update Function
Result
Solution Set
State
27
28. Delta Iteration
● Not all the elements of the state are updated
in each iteration.
● The elements that require an update, are
stored in the workset.
● The step function is applied only to the
workset elements.
28
29. Partition a graph into components by iteratively
propagating the min vertex ID among neighbors
Example: Connected Components
29
33. Read the documentation and our blog posts!
● Memory Management
● Serialization and Type Extraction
● Streaming Optimizations
● Fault-Tolerance
Want to learn more?
33
34. Apache Flink
Deep Dive
Vasia Kalavri
Flink Committer & KTH PhD student
vasia@apache.org
1st Apache Flink Meetup Stockholm
May 11, 2015