Portable stateful big data processing
in Apache Beam
Aljoscha Krettek/Kenneth Knowles
Apache Beam/Flink PMC
Software Engin...
What I’d like to talk about
1. Why stateful stream processing?
2. What is Apache Beam?
3. State
4. Timers
2
Why stateful stream
processing?
3
What is stream processing?
Computation
Computations on
never-ending
“streams” of data
records (“events”)
4
Distributed stream processing
Computation
Computation
spread across
many machines
Computation Computation
5
Stateful stream processing
Computation
State is usually
“scoped” to some
key in the data:
event color
State
6
Timely, stateful stream processing
Computation
*timers are a special kind of state
State
Timers*
7
Use cases that require stateful stream processing
● Operations and manufacturing
● Mobile gaming
● Wearables
● Automotive
...
Common Requirements
Stateful processing
Event-time processing*
Handling of out-of-order events*
9*check out watermarks
Example: mobile gaming
First event comes in
-> put into state AND register session timeout timer
Process events as they co...
What is
Apache Beam?
11
20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011
MapReduce
(paper)
Apache
Hadoop
Dataflow Model
(paper)
MillW...
The Beam Model
13
(Flink draws it more like this)
The Beam Model
Pipeline
14
PTransform
PCollection
(bounded or unbounded)
The Beam Vision
Sum Per Key
15
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
⋮
Apache Flink
local, o...
The Beam Vision
KafkaIO
16
input | KafkaIO.read()
Python
⋮
class KafkaIO extends
UnboundedSource { … }
Java
Apache Flink
l...
The Beam Model
What are you computing? (read, map, reduce)
17
Where in event time? (event time windowing)
When in processi...
What are you computing?
18
Read
Parallel connectors to
external systems
ParDo
Per element
"Map"
Grouping
Group by key,
Com...
State
19
What are you computing?
20
Read
Parallel connectors to
external systems
ParDo
Per element
"Map"
Grouping
Group by key,
Com...
Stateful ParDo
21
ParDo(DoFn)
Stateful ParDo "Example"
22
Per Key Quantiles
ParDo.of(new DoFn<...>() {
// declare some state
@StateId("quantiles")
priva...
Partitioning for efficient parallel execution
23
Quantiles Quantiles Quantiles
Parallelism!
24
Quantiles
DoFn
Quantiles
DoFn
Quantiles
DoFn
new DoFn<Foo, Quantiles<Foo>>() {
@StateId("quantiles")
priva...
Windowed State
25
Quantiles
DoFn
Quantiles
DoFn
Quantiles
DoFn
Window into Fixed windows of one hour
Expected result: Quan...
Windowed State
26
Quantiles
DoFn
Quantiles
DoFn
Quantiles
DoFn
Window into windows of 30 min sliding by 10 min
Expected re...
<k, w>1
<k, w>2
<k, w>3
...
"x" 3 7 15
"y" "fizz" "7" "fizzbuzz"
...
State is per key and window
Bonus: automatically garb...
Kinds of state
● Value - just a mutable cell for a value
● Bag - supports "blind writes"
● Combining - has a CombineFn bui...
Timers
29
Timely ParDo
30
Stateful ParDo
new DoFn<...>() {
@TimerId("timeout")
private final TimerSpec timeoutTimer =
TimerSpecs.tim...
Timers in processing time
"call me back in 5
seconds"
output based only
on incoming
element
output return
value of batched...
Timers in event time
"call me back when
the watermark hits
the end of the
window"
output
speculative result
immediately
ou...
● Per-key arbitrary numbering
● Output only when result changes
● Tighter "side input" management for slowly changing dime...
Performance considerations (cross-runner)
34
● Shuffle to colocate keys
● Linear processing of elements for key+window
● W...
TL;DR
State and Timers in Beam...
● … unlock new uses cases
● … they "just work" with event time windowing
● … are portabl...
Thank you for listening!
This talk:
● Me - @aljoscha / aljoscha@apache.org / @KennKnowles / kenn@apache.org
● These Slides...
Backup Slides
37
What about Combine?
38
Quantiles
Combiner
Window into Fixed windows of one hour
Expected result: Quantiles for each hour
Q...
Combine vs State (naive conceptualization)
39
Window hourly
Expected result:
Quantiles for each hour
Quantiles
Combiner
Wi...
Combine vs State (likely execution plan)
Quantiles
Combiner
Quantiles
Combiner
Quantiles
Combiner
Shuffle
Accumulators
Qua...
Combine vs State
Quantiles
Combiner
Quantiles
DoFn
output governed by trigger
(data/computation unaware) "output only when...
Example: Arbitrary indices
D,0
A
Assign indices
B
C
D
E
C,1
A,2
B,3
E,4
non-associative
non-commutative
non-deterministic
...
Nächste SlideShare
Wird geladen in …5
×

Aljoscha Krettek - Portable stateful big data processing in Apache Beam

295 Aufrufe

Veröffentlicht am

Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Aljoscha Krettek - Portable stateful big data processing in Apache Beam

  1. 1. Portable stateful big data processing in Apache Beam Aljoscha Krettek/Kenneth Knowles Apache Beam/Flink PMC Software Engineer @ dataArtisans aljoscha@apache.org / @aljoscha https://s.apache.org/stateful-beam-strata-london-2017 Strata London 2017
  2. 2. What I’d like to talk about 1. Why stateful stream processing? 2. What is Apache Beam? 3. State 4. Timers 2
  3. 3. Why stateful stream processing? 3
  4. 4. What is stream processing? Computation Computations on never-ending “streams” of data records (“events”) 4
  5. 5. Distributed stream processing Computation Computation spread across many machines Computation Computation 5
  6. 6. Stateful stream processing Computation State is usually “scoped” to some key in the data: event color State 6
  7. 7. Timely, stateful stream processing Computation *timers are a special kind of state State Timers* 7
  8. 8. Use cases that require stateful stream processing ● Operations and manufacturing ● Mobile gaming ● Wearables ● Automotive ● Solar power/power grid ● Network monitoring ● (Mobile) banking 8
  9. 9. Common Requirements Stateful processing Event-time processing* Handling of out-of-order events* 9*check out watermarks
  10. 10. Example: mobile gaming First event comes in -> put into state AND register session timeout timer Process events as they come and update state Timer fires -> compute result for session and delete state 10
  11. 11. What is Apache Beam? 11
  12. 12. 20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011 MapReduce (paper) Apache Hadoop Dataflow Model (paper) MillWheel (paper) Heron Apache Spark Apache Storm Apache Gearpump (incubating) Apache Apex Apache Flink Cloud Dataflow FlumeJava (paper) Apache Beam DAGs, DAGs, DAGs Apache Samza 12
  13. 13. The Beam Model 13 (Flink draws it more like this)
  14. 14. The Beam Model Pipeline 14 PTransform PCollection (bounded or unbounded)
  15. 15. The Beam Vision Sum Per Key 15 input.apply( Sum.integersPerKey()) Java input | Sum.PerKey() Python ⋮ Apache Flink local, on-prem, cloud Apache Spark local, on-prem, cloud Cloud Dataflow: fully managed ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating)
  16. 16. The Beam Vision KafkaIO 16 input | KafkaIO.read() Python ⋮ class KafkaIO extends UnboundedSource { … } Java Apache Flink local, on-prem, cloud Apache Spark local, on-prem, cloud Cloud Dataflow: fully managed ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating)
  17. 17. The Beam Model What are you computing? (read, map, reduce) 17 Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) The focus of today
  18. 18. What are you computing? 18 Read Parallel connectors to external systems ParDo Per element "Map" Grouping Group by key, Combine per key, "Reduce" Composite Encapsulated subgraph
  19. 19. State 19
  20. 20. What are you computing? 20 Read Parallel connectors to external systems ParDo Per element "Map" Grouping Group by key, Combine per key, "Reduce" Composite Encapsulated subgraph
  21. 21. Stateful ParDo 21 ParDo(DoFn)
  22. 22. Stateful ParDo "Example" 22 Per Key Quantiles ParDo.of(new DoFn<...>() { // declare some state @StateId("quantiles") private final StateSpec<...> quantilesState = StateSpecs.combining(); @ProcessElement public void process(..., @StateId("quantiles") state) { // update quantiles state.add(...) // output if needed } })
  23. 23. Partitioning for efficient parallel execution 23 Quantiles Quantiles Quantiles
  24. 24. Parallelism! 24 Quantiles DoFn Quantiles DoFn Quantiles DoFn new DoFn<Foo, Quantiles<Foo>>() { @StateId("quantiles") private final StateSpec<...> quantilesState = StateSpecs.combining(); … }
  25. 25. Windowed State 25 Quantiles DoFn Quantiles DoFn Quantiles DoFn Window into Fixed windows of one hour Expected result: Quantiles for each hour
  26. 26. Windowed State 26 Quantiles DoFn Quantiles DoFn Quantiles DoFn Window into windows of 30 min sliding by 10 min Expected result: Quantiles for 30 minutes sliding by 10 min
  27. 27. <k, w>1 <k, w>2 <k, w>3 ... "x" 3 7 15 "y" "fizz" "7" "fizzbuzz" ... State is per key and window Bonus: automatically garbage collected when a window expires (vs manual clearing of keyed state) 27
  28. 28. Kinds of state ● Value - just a mutable cell for a value ● Bag - supports "blind writes" ● Combining - has a CombineFn built in; can support blind writes and lazy accumulation ● Set - membership checking ● Map - lookups and partial writes 28
  29. 29. Timers 29
  30. 30. Timely ParDo 30 Stateful ParDo new DoFn<...>() { @TimerId("timeout") private final TimerSpec timeoutTimer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); @OnTimer("timout") public void timeout(...) { // access state, set new timers, output } … }
  31. 31. Timers in processing time "call me back in 5 seconds" output based only on incoming element output return value of batched RPC buffer request batched RPC On Timer On Element 31
  32. 32. Timers in event time "call me back when the watermark hits the end of the window" output speculative result immediately output replacement value only if changed store speculative result On Timer On Element 32
  33. 33. ● Per-key arbitrary numbering ● Output only when result changes ● Tighter "side input" management for slowly changing dimension ● Fine-grained combine aggregation and output control ● Per-key "workflows" like user sign up flow w/ expiration ● Low-latency deduplication (let the first through, squash the rest) More example uses for state & timers 33
  34. 34. Performance considerations (cross-runner) 34 ● Shuffle to colocate keys ● Linear processing of elements for key+window ● Window merging ● Storage of state and timers ● GC of state
  35. 35. TL;DR State and Timers in Beam... ● … unlock new uses cases ● … they "just work" with event time windowing ● … are portable across runners (implementation ongoing) 35
  36. 36. Thank you for listening! This talk: ● Me - @aljoscha / aljoscha@apache.org / @KennKnowles / kenn@apache.org ● These Slides - https://s.apache.org/stateful-beam-strata-london-2017 Go Deeper ● Design doc - https://s.apache.org/beam-state ● Blog post - https://beam.apache.org/blog/2017/02/13/stateful-processing.html Join the Beam community: ● User discussions - user@beam.apache.org ● Development discussions - dev@beam.apache.org ● Follow @ApacheBeam on Twitter https://beam.apache.org 36
  37. 37. Backup Slides 37
  38. 38. What about Combine? 38 Quantiles Combiner Window into Fixed windows of one hour Expected result: Quantiles for each hour Quantiles Combiner Quantiles Combiner
  39. 39. Combine vs State (naive conceptualization) 39 Window hourly Expected result: Quantiles for each hour Quantiles Combiner Window hourly Quantiles DoFn Expected result: Quantiles for each hour
  40. 40. Combine vs State (likely execution plan) Quantiles Combiner Quantiles Combiner Quantiles Combiner Shuffle Accumulators Quantiles DoFn Shuffle Elements associative commutative single-output enables optimizations (engine is in control) non-associative* non-commutative* multi-output side outputs (user is in control) 40
  41. 41. Combine vs State Quantiles Combiner Quantiles DoFn output governed by trigger (data/computation unaware) "output only when there's an interesting change" (data/computation aware) 41
  42. 42. Example: Arbitrary indices D,0 A Assign indices B C D E C,1 A,2 B,3 E,4 non-associative non-commutative non-deterministic and totally fine! 42

×