Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Aljoscha Krettek - Portable stateful big data processing in Apache Beam

763 Aufrufe

Veröffentlicht am

Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Aljoscha Krettek - Portable stateful big data processing in Apache Beam

  1. 1. Portable stateful big data processing in Apache Beam Aljoscha Krettek/Kenneth Knowles Apache Beam/Flink PMC Software Engineer @ dataArtisans aljoscha@apache.org / @aljoscha https://s.apache.org/stateful-beam-strata-london-2017 Strata London 2017
  2. 2. What I’d like to talk about 1. Why stateful stream processing? 2. What is Apache Beam? 3. State 4. Timers 2
  3. 3. Why stateful stream processing? 3
  4. 4. What is stream processing? Computation Computations on never-ending “streams” of data records (“events”) 4
  5. 5. Distributed stream processing Computation Computation spread across many machines Computation Computation 5
  6. 6. Stateful stream processing Computation State is usually “scoped” to some key in the data: event color State 6
  7. 7. Timely, stateful stream processing Computation *timers are a special kind of state State Timers* 7
  8. 8. Use cases that require stateful stream processing ● Operations and manufacturing ● Mobile gaming ● Wearables ● Automotive ● Solar power/power grid ● Network monitoring ● (Mobile) banking 8
  9. 9. Common Requirements Stateful processing Event-time processing* Handling of out-of-order events* 9*check out watermarks
  10. 10. Example: mobile gaming First event comes in -> put into state AND register session timeout timer Process events as they come and update state Timer fires -> compute result for session and delete state 10
  11. 11. What is Apache Beam? 11
  12. 12. 20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011 MapReduce (paper) Apache Hadoop Dataflow Model (paper) MillWheel (paper) Heron Apache Spark Apache Storm Apache Gearpump (incubating) Apache Apex Apache Flink Cloud Dataflow FlumeJava (paper) Apache Beam DAGs, DAGs, DAGs Apache Samza 12
  13. 13. The Beam Model 13 (Flink draws it more like this)
  14. 14. The Beam Model Pipeline 14 PTransform PCollection (bounded or unbounded)
  15. 15. The Beam Vision Sum Per Key 15 input.apply( Sum.integersPerKey()) Java input | Sum.PerKey() Python ⋮ Apache Flink local, on-prem, cloud Apache Spark local, on-prem, cloud Cloud Dataflow: fully managed ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating)
  16. 16. The Beam Vision KafkaIO 16 input | KafkaIO.read() Python ⋮ class KafkaIO extends UnboundedSource { … } Java Apache Flink local, on-prem, cloud Apache Spark local, on-prem, cloud Cloud Dataflow: fully managed ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating)
  17. 17. The Beam Model What are you computing? (read, map, reduce) 17 Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) The focus of today
  18. 18. What are you computing? 18 Read Parallel connectors to external systems ParDo Per element "Map" Grouping Group by key, Combine per key, "Reduce" Composite Encapsulated subgraph
  19. 19. State 19
  20. 20. What are you computing? 20 Read Parallel connectors to external systems ParDo Per element "Map" Grouping Group by key, Combine per key, "Reduce" Composite Encapsulated subgraph
  21. 21. Stateful ParDo 21 ParDo(DoFn)
  22. 22. Stateful ParDo "Example" 22 Per Key Quantiles ParDo.of(new DoFn<...>() { // declare some state @StateId("quantiles") private final StateSpec<...> quantilesState = StateSpecs.combining(); @ProcessElement public void process(..., @StateId("quantiles") state) { // update quantiles state.add(...) // output if needed } })
  23. 23. Partitioning for efficient parallel execution 23 Quantiles Quantiles Quantiles
  24. 24. Parallelism! 24 Quantiles DoFn Quantiles DoFn Quantiles DoFn new DoFn<Foo, Quantiles<Foo>>() { @StateId("quantiles") private final StateSpec<...> quantilesState = StateSpecs.combining(); … }
  25. 25. Windowed State 25 Quantiles DoFn Quantiles DoFn Quantiles DoFn Window into Fixed windows of one hour Expected result: Quantiles for each hour
  26. 26. Windowed State 26 Quantiles DoFn Quantiles DoFn Quantiles DoFn Window into windows of 30 min sliding by 10 min Expected result: Quantiles for 30 minutes sliding by 10 min
  27. 27. <k, w>1 <k, w>2 <k, w>3 ... "x" 3 7 15 "y" "fizz" "7" "fizzbuzz" ... State is per key and window Bonus: automatically garbage collected when a window expires (vs manual clearing of keyed state) 27
  28. 28. Kinds of state ● Value - just a mutable cell for a value ● Bag - supports "blind writes" ● Combining - has a CombineFn built in; can support blind writes and lazy accumulation ● Set - membership checking ● Map - lookups and partial writes 28
  29. 29. Timers 29
  30. 30. Timely ParDo 30 Stateful ParDo new DoFn<...>() { @TimerId("timeout") private final TimerSpec timeoutTimer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); @OnTimer("timout") public void timeout(...) { // access state, set new timers, output } … }
  31. 31. Timers in processing time "call me back in 5 seconds" output based only on incoming element output return value of batched RPC buffer request batched RPC On Timer On Element 31
  32. 32. Timers in event time "call me back when the watermark hits the end of the window" output speculative result immediately output replacement value only if changed store speculative result On Timer On Element 32
  33. 33. ● Per-key arbitrary numbering ● Output only when result changes ● Tighter "side input" management for slowly changing dimension ● Fine-grained combine aggregation and output control ● Per-key "workflows" like user sign up flow w/ expiration ● Low-latency deduplication (let the first through, squash the rest) More example uses for state & timers 33
  34. 34. Performance considerations (cross-runner) 34 ● Shuffle to colocate keys ● Linear processing of elements for key+window ● Window merging ● Storage of state and timers ● GC of state
  35. 35. TL;DR State and Timers in Beam... ● … unlock new uses cases ● … they "just work" with event time windowing ● … are portable across runners (implementation ongoing) 35
  36. 36. Thank you for listening! This talk: ● Me - @aljoscha / aljoscha@apache.org / @KennKnowles / kenn@apache.org ● These Slides - https://s.apache.org/stateful-beam-strata-london-2017 Go Deeper ● Design doc - https://s.apache.org/beam-state ● Blog post - https://beam.apache.org/blog/2017/02/13/stateful-processing.html Join the Beam community: ● User discussions - user@beam.apache.org ● Development discussions - dev@beam.apache.org ● Follow @ApacheBeam on Twitter https://beam.apache.org 36
  37. 37. Backup Slides 37
  38. 38. What about Combine? 38 Quantiles Combiner Window into Fixed windows of one hour Expected result: Quantiles for each hour Quantiles Combiner Quantiles Combiner
  39. 39. Combine vs State (naive conceptualization) 39 Window hourly Expected result: Quantiles for each hour Quantiles Combiner Window hourly Quantiles DoFn Expected result: Quantiles for each hour
  40. 40. Combine vs State (likely execution plan) Quantiles Combiner Quantiles Combiner Quantiles Combiner Shuffle Accumulators Quantiles DoFn Shuffle Elements associative commutative single-output enables optimizations (engine is in control) non-associative* non-commutative* multi-output side outputs (user is in control) 40
  41. 41. Combine vs State Quantiles Combiner Quantiles DoFn output governed by trigger (data/computation unaware) "output only when there's an interesting change" (data/computation aware) 41
  42. 42. Example: Arbitrary indices D,0 A Assign indices B C D E C,1 A,2 B,3 E,4 non-associative non-commutative non-deterministic and totally fine! 42

×