This is the slide deck that I used during my tutorial presentation at the ACM DEBS Conference (http://www.debs2017.org/) that was held in Barcelona between June 19 and June 23, 2017.
The tutorial paper itself can be accessed here: http://dl.acm.org/citation.cfm?id=3095110
Reflections on Almost Two Decades of Research into Stream Processing
1. 1
DEBS’17
Tutorial 5:
Reflections on Almost Two Decades
of Research into Stream Processing
Kyumars Sheykh Esmaili
Real-Time Information Processing (RTIP) research team
Bell Labs, Nokia Inc
20-06-2017
2. 2
Streaming platform for IoT applications
Home networks monitoring
Hadoop/HDFS
Stream schema, Stream provenance, Continuous query modification
A Short Bio
@kyumarss
4. 4
This Tutorial: Reflections on a Research History
• Highlights
- trends
- best practices
• Based on a select set of
- major stream processing systems
- landmark papers
• Also lists a few directions for future research
5. 5
What this Tutorial Is NOT: A Survey of the Field
Cugola, Gianpaolo, et al. "Processing flows of information:
From data stream to complex event processing." ACM
CSUR, 2012. (based on a DEBS tutorial)
Heinze, Thomas, et al. “Tutorial: Cloud-based Data
Stream Processing.” ACM DEBS, 2014.
6. 6
Scope: Stream Processing vs Related Research Domains
Active
Databases
Temporal
Databases
Sequence
Databases
CEP
Systems
Stream
Processing
7. 7
Main DBMS Principles
• Set data model
- Bounded
- Unordered
• Relational algebra/operators
• Tuples updatable/replaceable
- Random access
• Passive
• Query plan
8. 8
All Depart from the Established Principles of DBMSs
Active
Databases
Temporal
Databases
Sequence
Databases
CEP
Systems
Stream
Processing
• Main DBMS Principles
- Set data model
• Bounded
• Unordered
- Relational algebra/operators
- Tuples updatable/replaceable
• Random access
- Passive
- Query plan
Unordered
Bounded
Unordered
Unordered
Unordered
Passive
Passive
Random Access
Random Access
Passive
Bounded
Query Plan
Relational operators
9. 9
• Introduction (~10’)
• Part I: Notable Systems (~35’)
• ----------- Break (10’)-----------
• Part II: Trends (~15’)
• Part III: Best Practices (~10’)
• Part IV: Future Research Directions (~10’)
Outline
13. 13
Stream Processing Timeline: 1st Generation
1998 201620072001 2004 2010 2013
-Append-only model; fast sequential access (tape,
live from network)
-Impressive ideas: window ,multiplex, demul, flow
language, sequential reads, min copy
-Shared sub-queries
-Upside-down tree!
- Main requirements: performance and flexibility
-Defines order attributes with ordering properties
-GSQL (SQL + merge)
-Dedicated operators
-Punctuations/hearbeats to unlock operators
-No explicit window
-Edge processing (i.e. NIC)
- University of Wisconsin-Madison
-CQ subsystem of Niagara (“net” data management)
-On XML datasets, using XML-QL
-Key insight: large commonalities
-Inter-query optimization (large scale + incremental)
-It also splits queries
14. 14
Stream Processing Timeline: 1st Generation (cont.)
1998 201620072001 2004 2010 2013
- Brown Uni, Brandeis Uni, MIT
-Aimed at Monitoring streams
-Lots of emphasis on QoS,
approximate query answering
-Arrows and Boxes (via GUI)
-Notation of Slack and
Bounded Sort
-UC Berkeley
-Next step in the Telegraph project
-focused on adaptive query processing
-Eddies
-Flux
-Fjords
-Initially in Java.
- re-implemented based on PostgreSQL.
-Stanford Uni
-DSMS for processing continuous
queries over streams and relations
-An abstract semantics
-CQL: a concrete declarative query
language
15. 15
Stream Processing Timeline: 2nd Generation
1998 201620072001 2004 2010 2013
-Initially named Aurora*
-Focused on distribution
-Relies on Aurora for single node
stream processing and Medusa
for the distribution.
-Revision processing
-HA
-Connection Point and time travel
(replay mechanism)
- One of the most mature systems out there
-SPC & SPL
-SPC:
-Distributed, dynamic, and scalable
-Beyond relational operators
-Processing Elements (PEs) and PE Containers
-Notions such as subscription & discovery
-A very elaborate transport layer (Data Fabric)
-SPL:
-Custom language
-Procedural
-Code generation (C++)
-Originally SPADE (mostly, relational operators)
-SPL focuses on UDFs
-Operator spec includes selectivity, partitionability
-Optional deployment and optimization hints
16. 16
Stream Processing Timeline: 3rd Generation
1998 201620072001 2004 2010 2013
- Partially fault-tolerant
-No node addition/removal from
the cluster.
-Design influenced by System S
and MapReduce
-One PE per key value
-TTL-based removal
-Abandoned in favor of Storm
- First popular streaming platform
-Simple abstractions: spout and
bolts.
-Allows to build topologies.
-Platform takes care of shuffling,
transport.
-At-least once semantics
-Enriched with Trident:
-Overhauled in Heron
-UC Berkeley
- Hadoop Online Prototype
-Pipeline data between MapReduce
operators
-Co-scheduling
-Pull-based Reduce => push-based Map
-Retains the fault tolerance properties of
Hadoop
-Can run unmodified MapReduce
programs
17. 17
Stream Processing Timeline: 4th Generation
1998 201620072001 2004 2010 2013
- UC Berkeley
-Builds upon the Spark Core features
-Micro-batching
-A few new operators
-State is also treated as RDD
-Inherits fault tolerance capabilities of
Spark
-Offers exactly once
-High-throughput, “high” latency
-Taken backseat due to Structured
Streaming
- Real use cases at Google
-UDFs
-Out-of-order processing
(via watermarks)
-fault tolerance and exactly-
once semantics
-state management
18. 18
Stream Processing Timeline: 4th Generation (cont.)
1998 201620072001 2004 2010 2013
- Streaming as superset of batch
-Session windows
-Windowing, watermarks, trigger,
refinement
-FlumeJava + Millwheel
-Built on top of Kafka
-Heavily tied to it
-Unix philosophy
-At least once semantics
-Relies on Yarn for deployment
-Alternative: Kafka Streams
-TU Berlin
-A collection batch of academic
prototypes
-Aiming at batch and iterative
computations
-Native support for streaming
-UDFs as first class citizens
-Stateful is default
-Stratosphere => Flink
20. 20
Trends: Overview
1. From DSMSs to Big “Streaming” Data Frameworks
2. Domain-specific to General-purpose
3. Increased Importance of Exact Results
4. Richer Window Specifications
5. Unification of Batch and Streaming Models
23. 23
Examples of DBMS Influence on Early Stream Processing Systems
Golab, Lukasz, and M. Tamer Özsu. "Issues in data stream management." ACM Sigmod Record, 2003.
39. 39
A Little More on “Semantic” Windows
• Has always been supported by CEP systems
• Main Challenge: Unpredictability Artikis, Alexander, et al. "Complex Event Recognition Languages:
Tutorial.“ ACM DEBS, 2017.
47. 47
Use of Punctuation in Stream Processing Platforms
1998 201620072001 2004 2010 2013
48. 48
Use of Punctuation for Optimization
Tucker, Peter A., et al. "Exploiting punctuation
semantics in continuous data streams." IEEE TKDE,
2003.
Li, Jin, et al. "Semantics and evaluation
techniques for window aggregates in data
streams.“ ACM SIGMOD, 2005.
49. 49
Use of Punctuation for Query Modification
Sheykh Esmaili, Kyumars, et al. “Changing flights in mid-air: a model for safely modifying
continuous queries”, ACM SIGMOD, 2011.
50. 50
Use of Punctuation for Snapshotting
Carbone, Paris, et al. "Lightweight asynchronous snapshots for distributed dataflows." arXiv preprint
arXiv:1506.08603 (2015).
52. 52
State Management: Different Aspects
To, Quoc-Cuong, et al. "A Survey of State Management in Big Data Processing Systems." arXiv preprint
arXiv:1702.01596 (2017).
54. 54
Support for State Management in Stream Processing Systems
1998 201620072001 2004 2010 2013
55. 55
State Management Examples: Load Balancing and Auto-Parallelization
Gedik, Buğra, et al. "Elastic scaling for data
stream processing." IEEE Transactions on
Parallel and Distributed Systems, 2014.
Shah, Mehul A., et al. "Flux: An adaptive partitioning
operator for continuous query systems." ICDE, 2003.
58. 58
IoT-induced Requirements for Stream Processing Platforms
Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World-Wide Streams Platform.“ ACM DEBS, 2017.
59. 59
Nokia Bell Lab’s World Wide Streams (WWS) Platform: Bird’s Eye View
XStream Language &
XStream Studio
DeployerDeployer
Placement Algorithm
Site Monitor
Media
Processor
Processing Sites
XStream
Processor
Geo
Processor
Media Server
Message
Broker
StreamBridge
Dispatcher
Registry
Gateway
Compiler
Orchestration LayerExternal Interfaces
Architecture
60. 60
Reference
• Esmaili, Kyumars Sheykh. "Reflections on Almost Two Decades of Research into Stream
Processing.” ACM DEBS, 2017.
• Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World-
Wide Streams Platform.” ACM DEBS, 2017.