Guarantees for scalable stream processing come under many misleading names today: exactly-once processing, at-least once, end-to-end fault tolerance etc. In this talk, we will instead present a rigorous overview of epoch-based stream processing, a clear, underlying consistent processing model employed by Apache Flink. Epoch-based stream processing relies on the notion of epoch cuts, a restricted type of Chandy and Lamport's consistent cut. We will further examine different approaches for achieving epoch cuts and the performance implications, showcasing the benefits of asynchronous epoch snapshots employed by Apache Flink. Finally, we will summarize how Flink's epoch commit protocol coordinates operations with locally embedded and externally persisted state systems (e.g., Kafka, HDFS, Pravega) in practice to offer an externally consistent view of the state built by its applications.
12. How can we achieve reliable
processing at the presence of failures,
reconfiguration, migration etc.?
Task computation is not staged but
can go on indefinitely.
29. Reliable Stream Processing
• Existing approaches* typically adopt a fail recovery model to amend
individual task execution and reproduce computations that were possibly lost
• Complex Workarounds (e.g., duplicate elimination, input logging, acks)
• Strong Assumptions (idempotent operations, key vs task level causal order)
• External State Management (transactional external commits per action)
*MillWheel: Fault- tolerant stream processing at internet scale,” in VLDB, 2013.
Integrating scale out and fault tolerance in stream processing using operator state management. in SIGMOD 2013
Fault-tolerance and high availability in data stream management systems. in Encyclopedia of Database Systems 2009
Fault-tolerance in the Borealis distributed stream processing system, in SIGMOD 2005
30. Fault Tolerance is not enough
• Are output and states always correct?
• Can we reconfigure the system without losing computation?
• Can applications migrate without loss?
• Is external state access isolation possible?
31. Fault Tolerance is not enough
• Are output and states always correct?
• Can we reconfigure the system without losing computation?
• Can applications migrate without loss?
• Is external state access isolation possible?
• We need a system-wide coarse-grained commit mechanism.