Experiences Running
Apache Flink at
Very Large Scale
@StephanEwen
Berlin Buzzwords, 2017
1
Some large scale use cases
2
3
 Various use cases
• Example: Stream ingestion, route events to Kafka, ES, Hive
• Example: Model user interaction sessi...
4
@
5
 Blink based on Flink
 A core system in Alibaba Search
• Machine learning, search, recommendations
• A/B testing of se...
@
6
7
@
Social network implemented using event sourcing
and CQRS (Command Query Responsibility
Segregation) on Kafka/Flink/Ela...
How we learned to view Flink
through its users
8
System for Event–driven Applications
9
Event-driven
Applications
Stream Processing
Batch Processing
Stateful, event-driven...
Event Sourcing + Memory Image
10
event log
persists events
(temporarily)
event /
command
Process
main memory
update local
...
Event Sourcing + Memory Image
11
Recovery: Restore snapshot and replay events
since snapshot
event log
persists events
(te...
Distributed Memory Image
12
Distributed application, many memory images.
Snapshots are all consistent together.
Stateful Event & Stream Processing
13
Scalable embedded state
Access at memory speed &
scales with parallel operators
Stateful Event & Stream Processing
14
Re-load state
Reset positions
in input streams
Rolling back computation
Re-processing
Stateful Event & Stream Processing
15
Restore to different
programs
Bugfixes, Upgrades, A/B testing, etc
Compute, State, and Storage
16
Classic tiered architecture Streaming architecture
database
layer
compute
layer
application...
System for Event–driven Applications
17
Event-driven
Applications
Stream Processing
Batch Processing
Stateful, event-drive...
Apache Flink's Layered APIs
18
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic...
Lessons Learned from Running
Flink
19
20
The event/stream pipeline
generally just works

Interacting with the environment
 Dependency conflicts are amongst the biggest problems
• Next versions trying to radical...
External systems
 Dependency on any external system eventually causes
downtime
• Mainly: HDFS / S3 / NFS / … for checkpoi...
Type Serialization
 Type serialization is a harder problem in streaming than in
batch
• The data structure updates requir...
24
…is the most important part of
running a large scale Flink application
Robustly checkpointing…
Review: Checkpoints
25
Trigger checkpoint Inject checkpoint barrier
stateful
operation
source /
transform
Review: Checkpoints
26
Take state snapshot Trigger state
snapshot
stateful
operation
source /
transform
Review: Checkpoint Alignment
27
begin aligning
checkpoint
barrier n
xy
operator
aligning
ab
operator
23 1
input buffer
y
Review: Checkpoint Alignment
28
bc
operator
23 1
emit barrier n
c
operator
23 1
input buffer
continuecheckpoint
4 4
a
Understanding Checkpoints
29
Understanding Checkpoints
30
How well behaves
the alignment?
(lower is better)
How long do
snapshots take?
delay =
end_to_...
Understanding Checkpoints
31
How well behaves
the alignment?
(lower is better)
How long do
snapshots take?
delay =
end_to_...
Heavy alignments
 A heavy alignment typically happens at some point
 Different load on different paths
 Skewed window e...
Heavy alignments
 A heavy alignment typically happens at some point
 Different load on different paths
 Skewed window e...
Heavy alignments
 A heavy alignment typically happens at some point
 Different load on different paths
 Skewed window e...
Catching up from heavy alignments
 Operators that did heavy alignment need to catch up again
 Otherwise, next checkpoint...
Catching up from heavy alignments
 Giving the computation time to catch up before starting the
next checkpoint
• Set the ...
Asynchrony of different state types
40
State Flink 1.2 Flink 1.3 Flink 1.4
Keyed state
RocksDB ✔ ✔ ✔
Keyed State
on heap
✘...
When to use which state backend?
41
Async. Heap/FS RocksDB
State ≥ Memory ?
Complex Objects?
(expensive serialization)
hig...
42
We are hiring!
data-artisans.com/careers
44
Backup Slides
Avoiding DDOSing other systems
45
Exceeding FS request capacity
 Job size: multiple 1000 operators
 Checkpoint interval: few secs
 State size: KBs per op...
Reducing FS stress for small state
47
JobManager TaskManager
Checkpoint
Coordinator
Task
TaskManager
Task
TaskTask
Root Ch...
Reducing FS stress for small state
48
JobManager TaskManager
Checkpoint
Coordinator
Task
TaskManager
Task
TaskTask
checkpo...
Distributed Coordination
49
Deploying Tasks
50
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server...
Deploying Tasks
51
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server...
RPC volume during deployment
52
(back of the napkin calculation)
number of
tasks
2 MB
parallelism
size of task
objects
100...
Timeouts and Failure detection
53
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. ...
Dissecting the RPC messages
54
Message part Size
Variance across
subtasks
and redeploys
Job Configuration KBs constant
Tas...
Upcoming: Deploying Tasks
55
Out-of-band transfer and caching of
large and constant message parts
JobManager TaskManager
A...
Layers of abstraction
56
Ogres have
layers
So do
squirrels
Apache Flink's Layered APIs
57
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic...
Process Function
58
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in the program
laz...
Data Stream API
59
val lines: DataStream[String] = env.addSource(
new FlinkKafkaConsumer09<>(…))
val events: DataStream[Ev...
Table API & Stream SQL
60
Events, State, Time, and Snapshots
61
Events, State, Time, and Snapshots
62
f(a,b)
Event-driven function
executed distributedly
Events, State, Time, and Snapshots
63
f(a,b)
Maintain fault tolerant local state similar to
any normal application
Main me...
Events, State, Time, and Snapshots
64
f(a,b)
wall clock
event time clock
Access and react to
notions of time and progress,...
Events, State, Time, and Snapshots
65
f(a,b)
wall clock
event time clock
Snapshot point-in-time
view for recovery,
rollbac...
Stateful Event & Stream Processing
66
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSou...
Stateful Event & Stream Processing
67
Source
Filter /
Transform
State
read/write
Sink
Stateful Event & Stream Processing
68
Scalable embedded state
Access at memory speed &
scales with parallel operators
Stateful Event & Stream Processing
69
Re-load state
Reset positions
in input streams
Rolling back computation
Re-processing
Stateful Event & Stream Processing
70
Restore to different
programs
Bugfixes, Upgrades, A/B testing, etc
"Classical" versus
Streaming Architecture
71
Compute, State, and Storage
72
Classic tiered architecture Streaming architecture
database
layer
compute
layer
application...
Performance
73
synchronous reads/writes
across tier boundary
asynchronous writes
of large blobs
all modifications
are loca...
Consistency
74
distributed transactions
at scale typically
at-most / at-least once
exactly once
per state
=1 =1snapshot co...
Scaling a Service
75
separately provision additional
database capacity
provision compute
and state together
Classic tiered...
Rolling out a new Service
76
provision a new database
(or add capacity to an existing one)
provision compute
and state tog...
Repair External State
77
Streaming architecture
events
live application external state
wrong results
backed up data
(HDFS,...
Repair External State
78
Streaming architecture
live application external state
overwrite
with correct results
backed up d...
Repair External State
79
Streaming architecture
live application external state
overwrite
with correct results
backed up d...
Nächste SlideShare
Wird geladen in …5
×

Stephan Ewen - Experiences running Flink at Very Large Scale

569 Aufrufe

Veröffentlicht am

This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Stephan Ewen - Experiences running Flink at Very Large Scale

  1. 1. Experiences Running Apache Flink at Very Large Scale @StephanEwen Berlin Buzzwords, 2017 1
  2. 2. Some large scale use cases 2
  3. 3. 3  Various use cases • Example: Stream ingestion, route events to Kafka, ES, Hive • Example: Model user interaction sessions  Mix of stateless / moderate state / large state  Stream Processing as a Service • Launching, monitoring, scaling, updating @
  4. 4. 4 @
  5. 5. 5  Blink based on Flink  A core system in Alibaba Search • Machine learning, search, recommendations • A/B testing of search algorithms • Online feature updates to boost conversion rate  Alibaba is a major contributor to Flink  Contributing many changes back to open source @
  6. 6. @ 6
  7. 7. 7 @ Social network implemented using event sourcing and CQRS (Command Query Responsibility Segregation) on Kafka/Flink/Elasticsearch/Redis More: https://data-artisans.com/blog/drivetribe-cqrs-apache-flink
  8. 8. How we learned to view Flink through its users 8
  9. 9. System for Event–driven Applications 9 Event-driven Applications Stream Processing Batch Processing Stateful, event-driven, event-time-aware processing (event sourcing, CQRS, …) (streams, windows, …) (data sets)
  10. 10. Event Sourcing + Memory Image 10 event log persists events (temporarily) event / command Process main memory update local variables/structures periodically snapshot the memory
  11. 11. Event Sourcing + Memory Image 11 Recovery: Restore snapshot and replay events since snapshot event log persists events (temporarily) Process
  12. 12. Distributed Memory Image 12 Distributed application, many memory images. Snapshots are all consistent together.
  13. 13. Stateful Event & Stream Processing 13 Scalable embedded state Access at memory speed & scales with parallel operators
  14. 14. Stateful Event & Stream Processing 14 Re-load state Reset positions in input streams Rolling back computation Re-processing
  15. 15. Stateful Event & Stream Processing 15 Restore to different programs Bugfixes, Upgrades, A/B testing, etc
  16. 16. Compute, State, and Storage 16 Classic tiered architecture Streaming architecture database layer compute layer application state + backup compute + stream storage and snapshot storage (backup) application state
  17. 17. System for Event–driven Applications 17 Event-driven Applications Stream Processing Batch Processing Stateful, event-driven, event-time-aware processing (event sourcing, CQRS, …) (streams, windows, …) (data sets)
  18. 18. Apache Flink's Layered APIs 18 Process Function (events, state, time) DataStream API (streams, windows) Table API (dynamic tables) Stream SQL Stream- & Batch Processing Analytics Stateful Event-Driven Applications
  19. 19. Lessons Learned from Running Flink 19
  20. 20. 20 The event/stream pipeline generally just works 
  21. 21. Interacting with the environment  Dependency conflicts are amongst the biggest problems • Next versions trying to radically reduce dependencies • Make Hadoop an optional dependency • Rework shading techniques  The deployment ecosystem is crazy complex • Yarn, Mesos & DC/OS, Docker & K8s, standalone, … • Containers and overlay networks are tricky • Authorization and authentication ecosystem complex it itself • Continuous work to improve integration 21
  22. 22. External systems  Dependency on any external system eventually causes downtime • Mainly: HDFS / S3 / NFS / … for checkpoints  We plan to reduce dependency on those more and more in the next versions 22
  23. 23. Type Serialization  Type serialization is a harder problem in streaming than in batch • The data structure updates require more serialization • Types are often more complex than in batch  State lives long and across jobs • Requires to "version" state and serializers • Requires a "schema evolution" path • Much enhanced support in Flink 1.3, more still to come 23
  24. 24. 24 …is the most important part of running a large scale Flink application Robustly checkpointing…
  25. 25. Review: Checkpoints 25 Trigger checkpoint Inject checkpoint barrier stateful operation source / transform
  26. 26. Review: Checkpoints 26 Take state snapshot Trigger state snapshot stateful operation source / transform
  27. 27. Review: Checkpoint Alignment 27 begin aligning checkpoint barrier n xy operator aligning ab operator 23 1 input buffer y
  28. 28. Review: Checkpoint Alignment 28 bc operator 23 1 emit barrier n c operator 23 1 input buffer continuecheckpoint 4 4 a
  29. 29. Understanding Checkpoints 29
  30. 30. Understanding Checkpoints 30 How well behaves the alignment? (lower is better) How long do snapshots take? delay = end_to_end – sync – async
  31. 31. Understanding Checkpoints 31 How well behaves the alignment? (lower is better) How long do snapshots take? delay = end_to_end – sync – async long delay = under backpressure under constant backpressure means the application is under provisioned too long means  too much state per node  snapshot store cannot keep up with load (low bandwidth) vastly improved with incremental checkpoints in Flink 1.3 most important robustness metric
  32. 32. Heavy alignments  A heavy alignment typically happens at some point  Different load on different paths  Skewed window emission (lots of data on one node)  Stall of one operator on the path 34
  33. 33. Heavy alignments  A heavy alignment typically happens at some point  Different load on different paths  Skewed window emission (lots of data on one node)  Stall of one operator on the path 35
  34. 34. Heavy alignments  A heavy alignment typically happens at some point  Different load on different paths  Skewed window emission (lots of data on one node)  Stall of one operator on the path 36 GC stall
  35. 35. Catching up from heavy alignments  Operators that did heavy alignment need to catch up again  Otherwise, next checkpoint will have a heavy alignment as well 37 operator bc operator 23 14 a consumed first after checkpoint completed bc a
  36. 36. Catching up from heavy alignments  Giving the computation time to catch up before starting the next checkpoint • Set the min-time-between-checkpoints • Ideas to change checkpoints to policy based (spend x% of capacity on checkpoints)  Asynchronous checkpoints mitigate most of problem • Very short stalls in the pipelines means shorter alignment phase • Catch up already happens concurrently to state materialization 38
  37. 37. Asynchrony of different state types 40 State Flink 1.2 Flink 1.3 Flink 1.4 Keyed state RocksDB ✔ ✔ ✔ Keyed State on heap ✘ (✔) (hidden in 1.2.1) ✔ ✔ Timers ✘ ✘ ✔ (PR) Operator State ✘ ✔ ✔
  38. 38. When to use which state backend? 41 Async. Heap/FS RocksDB State ≥ Memory ? Complex Objects? (expensive serialization) high data rate? no yes yes no yes no a bit simplified
  39. 39. 42
  40. 40. We are hiring! data-artisans.com/careers
  41. 41. 44 Backup Slides
  42. 42. Avoiding DDOSing other systems 45
  43. 43. Exceeding FS request capacity  Job size: multiple 1000 operators  Checkpoint interval: few secs  State size: KBs per operator, 1000 of state chunks  Via the S3 FS (from Hadoop), writes ensure "directory" exists, 2 HEAD requests  Symptom: S3 blocked off connections after exceeding 1000s HEAD requests / sec 46
  44. 44. Reducing FS stress for small state 47 JobManager TaskManager Checkpoint Coordinator Task TaskManager Task TaskTask Root Checkpoint File (metadata) checkpoint data files Fs/RocksDB state backend for most states
  45. 45. Reducing FS stress for small state 48 JobManager TaskManager Checkpoint Coordinator Task TaskManager Task TaskTask checkpoint data directly in metadata file Fs/RocksDB state backend for small states ack+data Increasing small state threshold reduces number of files (default: 1KB)
  46. 46. Distributed Coordination 49
  47. 47. Deploying Tasks 50 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs
  48. 48. Deploying Tasks 51 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs KBs up to MBs KBs few bytes
  49. 49. RPC volume during deployment 52 (back of the napkin calculation) number of tasks 2 MB parallelism size of task objects 100010 x x x x = = RPC volume 20 GB ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net
  50. 50. Timeouts and Failure detection 53 ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net Default RPC timeout: 10 secs default settings lead to failed deployments with RPC timeouts Solution: Increase RPC timeout Caveat: Increasing the timeout makes failure detection slower Future: Reduce RPC load (next slides)
  51. 51. Dissecting the RPC messages 54 Message part Size Variance across subtasks and redeploys Job Configuration KBs constant Task Code and Objects up to MBs constant Recover State Handle KBs variable Correlation IDs few bytes variable
  52. 52. Upcoming: Deploying Tasks 55 Out-of-band transfer and caching of large and constant message parts JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Cache (1) Deployment RPC Call (Recover State Handle, Correlation IDs, BLOB pointers) (2) Download and cache BLOBs (Job Config, Task Objects) MBs KBs
  53. 53. Layers of abstraction 56 Ogres have layers So do squirrels
  54. 54. Apache Flink's Layered APIs 57 Process Function (events, state, time) DataStream API (streams, windows) Table API (dynamic tables) Stream SQL Stream- & Batch Processing Analytics Stateful Event-Driven Applications
  55. 55. Process Function 58 class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } }
  56. 56. Data Stream API 59 val lines: DataStream[String] = env.addSource( new FlinkKafkaConsumer09<>(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path))
  57. 57. Table API & Stream SQL 60
  58. 58. Events, State, Time, and Snapshots 61
  59. 59. Events, State, Time, and Snapshots 62 f(a,b) Event-driven function executed distributedly
  60. 60. Events, State, Time, and Snapshots 63 f(a,b) Maintain fault tolerant local state similar to any normal application Main memory + out of core (for maps)
  61. 61. Events, State, Time, and Snapshots 64 f(a,b) wall clock event time clock Access and react to notions of time and progress, handle out-of-order events
  62. 62. Events, State, Time, and Snapshots 65 f(a,b) wall clock event time clock Snapshot point-in-time view for recovery, rollback, cloning, versioning, etc.
  63. 63. Stateful Event & Stream Processing 66 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Streaming Dataflow Source Transform Window (state read/write) Sink
  64. 64. Stateful Event & Stream Processing 67 Source Filter / Transform State read/write Sink
  65. 65. Stateful Event & Stream Processing 68 Scalable embedded state Access at memory speed & scales with parallel operators
  66. 66. Stateful Event & Stream Processing 69 Re-load state Reset positions in input streams Rolling back computation Re-processing
  67. 67. Stateful Event & Stream Processing 70 Restore to different programs Bugfixes, Upgrades, A/B testing, etc
  68. 68. "Classical" versus Streaming Architecture 71
  69. 69. Compute, State, and Storage 72 Classic tiered architecture Streaming architecture database layer compute layer application state + backup compute + stream storage and snapshot storage (backup) application state
  70. 70. Performance 73 synchronous reads/writes across tier boundary asynchronous writes of large blobs all modifications are local Classic tiered architecture Streaming architecture
  71. 71. Consistency 74 distributed transactions at scale typically at-most / at-least once exactly once per state =1 =1snapshot consistency across states Classic tiered architecture Streaming architecture
  72. 72. Scaling a Service 75 separately provision additional database capacity provision compute and state together Classic tiered architecture Streaming architecture provision compute
  73. 73. Rolling out a new Service 76 provision a new database (or add capacity to an existing one) provision compute and state together simply occupies some additional backup space Classic tiered architecture Streaming architecture
  74. 74. Repair External State 77 Streaming architecture events live application external state wrong results backed up data (HDFS, S3, etc.)
  75. 75. Repair External State 78 Streaming architecture live application external state overwrite with correct results backed up data (HDFS, S3, etc.) application on backup input events
  76. 76. Repair External State 79 Streaming architecture live application external state overwrite with correct results backed up date (HDFS, S3, etc.) Each application doubles as a batch job! application on backup input events

×