SlideShare ist ein Scribd-Unternehmen logo
1 von 60
1
DEBS’17
Tutorial 5:
Reflections on Almost Two Decades
of Research into Stream Processing
Kyumars Sheykh Esmaili
Real-Time Information Processing (RTIP) research team
Bell Labs, Nokia Inc
20-06-2017
2
Streaming platform for IoT applications
Home networks monitoring
Hadoop/HDFS
Stream schema, Stream provenance, Continuous query modification
A Short Bio
@kyumarss
3
Introduction: Streaming Has Gone Mainstream
4
This Tutorial: Reflections on a Research History
• Highlights
- trends
- best practices
• Based on a select set of
- major stream processing systems
- landmark papers
• Also lists a few directions for future research
5
What this Tutorial Is NOT: A Survey of the Field
Cugola, Gianpaolo, et al. "Processing flows of information:
From data stream to complex event processing." ACM
CSUR, 2012. (based on a DEBS tutorial)
Heinze, Thomas, et al. “Tutorial: Cloud-based Data
Stream Processing.” ACM DEBS, 2014.
6
Scope: Stream Processing vs Related Research Domains
Active
Databases
Temporal
Databases
Sequence
Databases
CEP
Systems
Stream
Processing
7
Main DBMS Principles
• Set data model
- Bounded
- Unordered
• Relational algebra/operators
• Tuples updatable/replaceable
- Random access
• Passive
• Query plan
8
All Depart from the Established Principles of DBMSs
Active
Databases
Temporal
Databases
Sequence
Databases
CEP
Systems
Stream
Processing
• Main DBMS Principles
- Set data model
• Bounded
• Unordered
- Relational algebra/operators
- Tuples updatable/replaceable
• Random access
- Passive
- Query plan
Unordered
Bounded
Unordered
Unordered
Unordered
Passive
Passive
Random Access
Random Access
Passive
Bounded
Query Plan
Relational operators
9
• Introduction (~10’)
• Part I: Notable Systems (~35’)
• ----------- Break (10’)-----------
• Part II: Trends (~15’)
• Part III: Best Practices (~10’)
• Part IV: Future Research Directions (~10’)
Outline
10
Part I: Notable Systems
11
Stream Processing Timeline
1998 201620072001 2004 2010 2013
12
Stream Processing Timeline
1998 201620072001 2004 2010 2013
1st Generation
2st Generation
3rd Generation
4th Generation
13
Stream Processing Timeline: 1st Generation
1998 201620072001 2004 2010 2013
-Append-only model; fast sequential access (tape,
live from network)
-Impressive ideas: window ,multiplex, demul, flow
language, sequential reads, min copy
-Shared sub-queries
-Upside-down tree!
- Main requirements: performance and flexibility
-Defines order attributes with ordering properties
-GSQL (SQL + merge)
-Dedicated operators
-Punctuations/hearbeats to unlock operators
-No explicit window
-Edge processing (i.e. NIC)
- University of Wisconsin-Madison
-CQ subsystem of Niagara (“net” data management)
-On XML datasets, using XML-QL
-Key insight: large commonalities
-Inter-query optimization (large scale + incremental)
-It also splits queries
14
Stream Processing Timeline: 1st Generation (cont.)
1998 201620072001 2004 2010 2013
- Brown Uni, Brandeis Uni, MIT
-Aimed at Monitoring streams
-Lots of emphasis on QoS,
approximate query answering
-Arrows and Boxes (via GUI)
-Notation of Slack and
Bounded Sort
-UC Berkeley
-Next step in the Telegraph project
-focused on adaptive query processing
-Eddies
-Flux
-Fjords
-Initially in Java.
- re-implemented based on PostgreSQL.
-Stanford Uni
-DSMS for processing continuous
queries over streams and relations
-An abstract semantics
-CQL: a concrete declarative query
language
15
Stream Processing Timeline: 2nd Generation
1998 201620072001 2004 2010 2013
-Initially named Aurora*
-Focused on distribution
-Relies on Aurora for single node
stream processing and Medusa
for the distribution.
-Revision processing
-HA
-Connection Point and time travel
(replay mechanism)
- One of the most mature systems out there
-SPC & SPL
-SPC:
-Distributed, dynamic, and scalable
-Beyond relational operators
-Processing Elements (PEs) and PE Containers
-Notions such as subscription & discovery
-A very elaborate transport layer (Data Fabric)
-SPL:
-Custom language
-Procedural
-Code generation (C++)
-Originally SPADE (mostly, relational operators)
-SPL focuses on UDFs
-Operator spec includes selectivity, partitionability
-Optional deployment and optimization hints
16
Stream Processing Timeline: 3rd Generation
1998 201620072001 2004 2010 2013
- Partially fault-tolerant
-No node addition/removal from
the cluster.
-Design influenced by System S
and MapReduce
-One PE per key value
-TTL-based removal
-Abandoned in favor of Storm
- First popular streaming platform
-Simple abstractions: spout and
bolts.
-Allows to build topologies.
-Platform takes care of shuffling,
transport.
-At-least once semantics
-Enriched with Trident:
-Overhauled in Heron
-UC Berkeley
- Hadoop Online Prototype
-Pipeline data between MapReduce
operators
-Co-scheduling
-Pull-based Reduce => push-based Map
-Retains the fault tolerance properties of
Hadoop
-Can run unmodified MapReduce
programs
17
Stream Processing Timeline: 4th Generation
1998 201620072001 2004 2010 2013
- UC Berkeley
-Builds upon the Spark Core features
-Micro-batching
-A few new operators
-State is also treated as RDD
-Inherits fault tolerance capabilities of
Spark
-Offers exactly once
-High-throughput, “high” latency
-Taken backseat due to Structured
Streaming
- Real use cases at Google
-UDFs
-Out-of-order processing
(via watermarks)
-fault tolerance and exactly-
once semantics
-state management
18
Stream Processing Timeline: 4th Generation (cont.)
1998 201620072001 2004 2010 2013
- Streaming as superset of batch
-Session windows
-Windowing, watermarks, trigger,
refinement
-FlumeJava + Millwheel
-Built on top of Kafka
-Heavily tied to it
-Unix philosophy
-At least once semantics
-Relies on Yarn for deployment
-Alternative: Kafka Streams
-TU Berlin
-A collection batch of academic
prototypes
-Aiming at batch and iterative
computations
-Native support for streaming
-UDFs as first class citizens
-Stateful is default
-Stratosphere => Flink
19
Part II: Trends
20
Trends: Overview
1. From DSMSs to Big “Streaming” Data Frameworks
2. Domain-specific to General-purpose
3. Increased Importance of Exact Results
4. Richer Window Specifications
5. Unification of Batch and Streaming Models
21
Trend 1:
From DSMSs to Big “Streaming” Data Frameworks
22
Primary Influencer: DBMS vs Big Data Frameworks
1998 201620072001 2004 2010 2013
23
Examples of DBMS Influence on Early Stream Processing Systems
Golab, Lukasz, and M. Tamer Özsu. "Issues in data stream management." ACM Sigmod Record, 2003.
24
Examples of DBMS Influence on Early Stream Processing Systems
25
Trend 2:
Domain-specific to General-purpose
26
Initial Streaming Use Cases: Network Traffic + Sensor Networks
1998 201620072001 2004 2010 2013
27
Early Use Cases
Golab, Lukasz, and M. Tamer Özsu. "Issues in data stream management." ACM Sigmod Record, 2003.
28
Trend 3:
Increased Importance of Exact Results
29
Approximate Query Processing vs Exact Results
1998 201620072001 2004 2010 2013
30
Example: Approximate Query Processing in Aurora/Borealis
31
Another Angle: One-pass Computation vs Replayability
1998 201620072001 2004 2010 2013
32
Going Beyond Guaranteed Delivery: Transactional Stream Processing
Meehan, John, et al. "S-store: Streaming meets transaction
processing.“ VLDB, 2015.
Affetti, Lorenzo, et al. "FlowDB: Integrating Stream Processing
and Consistent State Management.“ACM DEBS, 2017.
33
Trend 4:
Richer Window Specifications
34
Window Types Supported by Almost All Systems
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
35
Sessions: New Window Type
36
Support for Session Windows
1998 201620072001 2004 2010 2013
37
Frames: Data-driven Windows
1998 201620072001 2004 2010 2013
38
Frames
Grossniklaus, Michael, et al. “Frames: data-driven windows.” ACM DEBS, 2016.
39
A Little More on “Semantic” Windows
• Has always been supported by CEP systems
• Main Challenge: Unpredictability Artikis, Alexander, et al. "Complex Event Recognition Languages:
Tutorial.“ ACM DEBS, 2017.
40
Trend 5:
Unification of Batch and Streaming Models
41
First Attempt: Lambda Architecture
42
The New Alternative: Unified Engines
1998 201620072001 2004 2010 2013
43
Examples of Unified Engines
44
Part III: Best Practices
45
Best Practices: Overview
1. Simplified Reasoning and Coordination via Punctuation
2. System-wide State Management
46
Best Practice1:
Simplified Reasoning and Coordination via Punctuation
47
Use of Punctuation in Stream Processing Platforms
1998 201620072001 2004 2010 2013
48
Use of Punctuation for Optimization
Tucker, Peter A., et al. "Exploiting punctuation
semantics in continuous data streams." IEEE TKDE,
2003.
Li, Jin, et al. "Semantics and evaluation
techniques for window aggregates in data
streams.“ ACM SIGMOD, 2005.
49
Use of Punctuation for Query Modification
Sheykh Esmaili, Kyumars, et al. “Changing flights in mid-air: a model for safely modifying
continuous queries”, ACM SIGMOD, 2011.
50
Use of Punctuation for Snapshotting
Carbone, Paris, et al. "Lightweight asynchronous snapshots for distributed dataflows." arXiv preprint
arXiv:1506.08603 (2015).
51
Best Practice2:
System-wide State Management
52
State Management: Different Aspects
To, Quoc-Cuong, et al. "A Survey of State Management in Big Data Processing Systems." arXiv preprint
arXiv:1702.01596 (2017).
53
State Management in Stream Processing: Main Cases
54
Support for State Management in Stream Processing Systems
1998 201620072001 2004 2010 2013
55
State Management Examples: Load Balancing and Auto-Parallelization
Gedik, Buğra, et al. "Elastic scaling for data
stream processing." IEEE Transactions on
Parallel and Distributed Systems, 2014.
Shah, Mehul A., et al. "Flux: An adaptive partitioning
operator for continuous query systems." ICDE, 2003.
56
State Management Examples: Fault Tolerance
57
Part IV: Future Research
Directions
58
IoT-induced Requirements for Stream Processing Platforms
Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World-Wide Streams Platform.“ ACM DEBS, 2017.
59
Nokia Bell Lab’s World Wide Streams (WWS) Platform: Bird’s Eye View
XStream Language &
XStream Studio
DeployerDeployer
Placement Algorithm
Site Monitor
Media
Processor
Processing Sites
XStream
Processor
Geo
Processor
Media Server
Message
Broker
StreamBridge
Dispatcher
Registry
Gateway
Compiler
Orchestration LayerExternal Interfaces
Architecture
60
Reference
• Esmaili, Kyumars Sheykh. "Reflections on Almost Two Decades of Research into Stream
Processing.” ACM DEBS, 2017.
• Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World-
Wide Streams Platform.” ACM DEBS, 2017.

Weitere ähnliche Inhalte

Was ist angesagt?

A New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data GridA New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data GridEditor IJCATR
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataXing Xu
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Automatic creation of mappings between classification systems for bibliograph...
Automatic creation of mappings between classification systems for bibliograph...Automatic creation of mappings between classification systems for bibliograph...
Automatic creation of mappings between classification systems for bibliograph...Magnus Pfeffer
 
Automatic creation of mappings between classification systems
Automatic creation of mappings between classification systemsAutomatic creation of mappings between classification systems
Automatic creation of mappings between classification systemsMagnus Pfeffer
 

Was ist angesagt? (8)

A New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data GridA New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data Grid
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Automatic creation of mappings between classification systems for bibliograph...
Automatic creation of mappings between classification systems for bibliograph...Automatic creation of mappings between classification systems for bibliograph...
Automatic creation of mappings between classification systems for bibliograph...
 
Automatic creation of mappings between classification systems
Automatic creation of mappings between classification systemsAutomatic creation of mappings between classification systems
Automatic creation of mappings between classification systems
 

Ähnlich wie Reflections on Almost Two Decades of Research into Stream Processing

Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
 
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-BellafioreDSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-BellafioreDeltares
 
data streammining and its applications.ppt
data streammining and its applications.pptdata streammining and its applications.ppt
data streammining and its applications.pptajajkhan16
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfRim Moussa
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
SDN: Situação do mercado e próximos movimentos
SDN: Situação do mercado e próximos movimentosSDN: Situação do mercado e próximos movimentos
SDN: Situação do mercado e próximos movimentosChristian Esteve Rothenberg
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 

Ähnlich wie Reflections on Almost Two Decades of Research into Stream Processing (20)

Stream Processing
Stream Processing Stream Processing
Stream Processing
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-BellafioreDSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
 
data streammining and its applications.ppt
data streammining and its applications.pptdata streammining and its applications.ppt
data streammining and its applications.ppt
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
SDN: Situação do mercado e próximos movimentos
SDN: Situação do mercado e próximos movimentosSDN: Situação do mercado e próximos movimentos
SDN: Situação do mercado e próximos movimentos
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 

Reflections on Almost Two Decades of Research into Stream Processing

  • 1. 1 DEBS’17 Tutorial 5: Reflections on Almost Two Decades of Research into Stream Processing Kyumars Sheykh Esmaili Real-Time Information Processing (RTIP) research team Bell Labs, Nokia Inc 20-06-2017
  • 2. 2 Streaming platform for IoT applications Home networks monitoring Hadoop/HDFS Stream schema, Stream provenance, Continuous query modification A Short Bio @kyumarss
  • 4. 4 This Tutorial: Reflections on a Research History • Highlights - trends - best practices • Based on a select set of - major stream processing systems - landmark papers • Also lists a few directions for future research
  • 5. 5 What this Tutorial Is NOT: A Survey of the Field Cugola, Gianpaolo, et al. "Processing flows of information: From data stream to complex event processing." ACM CSUR, 2012. (based on a DEBS tutorial) Heinze, Thomas, et al. “Tutorial: Cloud-based Data Stream Processing.” ACM DEBS, 2014.
  • 6. 6 Scope: Stream Processing vs Related Research Domains Active Databases Temporal Databases Sequence Databases CEP Systems Stream Processing
  • 7. 7 Main DBMS Principles • Set data model - Bounded - Unordered • Relational algebra/operators • Tuples updatable/replaceable - Random access • Passive • Query plan
  • 8. 8 All Depart from the Established Principles of DBMSs Active Databases Temporal Databases Sequence Databases CEP Systems Stream Processing • Main DBMS Principles - Set data model • Bounded • Unordered - Relational algebra/operators - Tuples updatable/replaceable • Random access - Passive - Query plan Unordered Bounded Unordered Unordered Unordered Passive Passive Random Access Random Access Passive Bounded Query Plan Relational operators
  • 9. 9 • Introduction (~10’) • Part I: Notable Systems (~35’) • ----------- Break (10’)----------- • Part II: Trends (~15’) • Part III: Best Practices (~10’) • Part IV: Future Research Directions (~10’) Outline
  • 11. 11 Stream Processing Timeline 1998 201620072001 2004 2010 2013
  • 12. 12 Stream Processing Timeline 1998 201620072001 2004 2010 2013 1st Generation 2st Generation 3rd Generation 4th Generation
  • 13. 13 Stream Processing Timeline: 1st Generation 1998 201620072001 2004 2010 2013 -Append-only model; fast sequential access (tape, live from network) -Impressive ideas: window ,multiplex, demul, flow language, sequential reads, min copy -Shared sub-queries -Upside-down tree! - Main requirements: performance and flexibility -Defines order attributes with ordering properties -GSQL (SQL + merge) -Dedicated operators -Punctuations/hearbeats to unlock operators -No explicit window -Edge processing (i.e. NIC) - University of Wisconsin-Madison -CQ subsystem of Niagara (“net” data management) -On XML datasets, using XML-QL -Key insight: large commonalities -Inter-query optimization (large scale + incremental) -It also splits queries
  • 14. 14 Stream Processing Timeline: 1st Generation (cont.) 1998 201620072001 2004 2010 2013 - Brown Uni, Brandeis Uni, MIT -Aimed at Monitoring streams -Lots of emphasis on QoS, approximate query answering -Arrows and Boxes (via GUI) -Notation of Slack and Bounded Sort -UC Berkeley -Next step in the Telegraph project -focused on adaptive query processing -Eddies -Flux -Fjords -Initially in Java. - re-implemented based on PostgreSQL. -Stanford Uni -DSMS for processing continuous queries over streams and relations -An abstract semantics -CQL: a concrete declarative query language
  • 15. 15 Stream Processing Timeline: 2nd Generation 1998 201620072001 2004 2010 2013 -Initially named Aurora* -Focused on distribution -Relies on Aurora for single node stream processing and Medusa for the distribution. -Revision processing -HA -Connection Point and time travel (replay mechanism) - One of the most mature systems out there -SPC & SPL -SPC: -Distributed, dynamic, and scalable -Beyond relational operators -Processing Elements (PEs) and PE Containers -Notions such as subscription & discovery -A very elaborate transport layer (Data Fabric) -SPL: -Custom language -Procedural -Code generation (C++) -Originally SPADE (mostly, relational operators) -SPL focuses on UDFs -Operator spec includes selectivity, partitionability -Optional deployment and optimization hints
  • 16. 16 Stream Processing Timeline: 3rd Generation 1998 201620072001 2004 2010 2013 - Partially fault-tolerant -No node addition/removal from the cluster. -Design influenced by System S and MapReduce -One PE per key value -TTL-based removal -Abandoned in favor of Storm - First popular streaming platform -Simple abstractions: spout and bolts. -Allows to build topologies. -Platform takes care of shuffling, transport. -At-least once semantics -Enriched with Trident: -Overhauled in Heron -UC Berkeley - Hadoop Online Prototype -Pipeline data between MapReduce operators -Co-scheduling -Pull-based Reduce => push-based Map -Retains the fault tolerance properties of Hadoop -Can run unmodified MapReduce programs
  • 17. 17 Stream Processing Timeline: 4th Generation 1998 201620072001 2004 2010 2013 - UC Berkeley -Builds upon the Spark Core features -Micro-batching -A few new operators -State is also treated as RDD -Inherits fault tolerance capabilities of Spark -Offers exactly once -High-throughput, “high” latency -Taken backseat due to Structured Streaming - Real use cases at Google -UDFs -Out-of-order processing (via watermarks) -fault tolerance and exactly- once semantics -state management
  • 18. 18 Stream Processing Timeline: 4th Generation (cont.) 1998 201620072001 2004 2010 2013 - Streaming as superset of batch -Session windows -Windowing, watermarks, trigger, refinement -FlumeJava + Millwheel -Built on top of Kafka -Heavily tied to it -Unix philosophy -At least once semantics -Relies on Yarn for deployment -Alternative: Kafka Streams -TU Berlin -A collection batch of academic prototypes -Aiming at batch and iterative computations -Native support for streaming -UDFs as first class citizens -Stateful is default -Stratosphere => Flink
  • 20. 20 Trends: Overview 1. From DSMSs to Big “Streaming” Data Frameworks 2. Domain-specific to General-purpose 3. Increased Importance of Exact Results 4. Richer Window Specifications 5. Unification of Batch and Streaming Models
  • 21. 21 Trend 1: From DSMSs to Big “Streaming” Data Frameworks
  • 22. 22 Primary Influencer: DBMS vs Big Data Frameworks 1998 201620072001 2004 2010 2013
  • 23. 23 Examples of DBMS Influence on Early Stream Processing Systems Golab, Lukasz, and M. Tamer Özsu. "Issues in data stream management." ACM Sigmod Record, 2003.
  • 24. 24 Examples of DBMS Influence on Early Stream Processing Systems
  • 26. 26 Initial Streaming Use Cases: Network Traffic + Sensor Networks 1998 201620072001 2004 2010 2013
  • 27. 27 Early Use Cases Golab, Lukasz, and M. Tamer Özsu. "Issues in data stream management." ACM Sigmod Record, 2003.
  • 29. 29 Approximate Query Processing vs Exact Results 1998 201620072001 2004 2010 2013
  • 30. 30 Example: Approximate Query Processing in Aurora/Borealis
  • 31. 31 Another Angle: One-pass Computation vs Replayability 1998 201620072001 2004 2010 2013
  • 32. 32 Going Beyond Guaranteed Delivery: Transactional Stream Processing Meehan, John, et al. "S-store: Streaming meets transaction processing.“ VLDB, 2015. Affetti, Lorenzo, et al. "FlowDB: Integrating Stream Processing and Consistent State Management.“ACM DEBS, 2017.
  • 33. 33 Trend 4: Richer Window Specifications
  • 34. 34 Window Types Supported by Almost All Systems https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  • 36. 36 Support for Session Windows 1998 201620072001 2004 2010 2013
  • 37. 37 Frames: Data-driven Windows 1998 201620072001 2004 2010 2013
  • 38. 38 Frames Grossniklaus, Michael, et al. “Frames: data-driven windows.” ACM DEBS, 2016.
  • 39. 39 A Little More on “Semantic” Windows • Has always been supported by CEP systems • Main Challenge: Unpredictability Artikis, Alexander, et al. "Complex Event Recognition Languages: Tutorial.“ ACM DEBS, 2017.
  • 40. 40 Trend 5: Unification of Batch and Streaming Models
  • 41. 41 First Attempt: Lambda Architecture
  • 42. 42 The New Alternative: Unified Engines 1998 201620072001 2004 2010 2013
  • 44. 44 Part III: Best Practices
  • 45. 45 Best Practices: Overview 1. Simplified Reasoning and Coordination via Punctuation 2. System-wide State Management
  • 46. 46 Best Practice1: Simplified Reasoning and Coordination via Punctuation
  • 47. 47 Use of Punctuation in Stream Processing Platforms 1998 201620072001 2004 2010 2013
  • 48. 48 Use of Punctuation for Optimization Tucker, Peter A., et al. "Exploiting punctuation semantics in continuous data streams." IEEE TKDE, 2003. Li, Jin, et al. "Semantics and evaluation techniques for window aggregates in data streams.“ ACM SIGMOD, 2005.
  • 49. 49 Use of Punctuation for Query Modification Sheykh Esmaili, Kyumars, et al. “Changing flights in mid-air: a model for safely modifying continuous queries”, ACM SIGMOD, 2011.
  • 50. 50 Use of Punctuation for Snapshotting Carbone, Paris, et al. "Lightweight asynchronous snapshots for distributed dataflows." arXiv preprint arXiv:1506.08603 (2015).
  • 52. 52 State Management: Different Aspects To, Quoc-Cuong, et al. "A Survey of State Management in Big Data Processing Systems." arXiv preprint arXiv:1702.01596 (2017).
  • 53. 53 State Management in Stream Processing: Main Cases
  • 54. 54 Support for State Management in Stream Processing Systems 1998 201620072001 2004 2010 2013
  • 55. 55 State Management Examples: Load Balancing and Auto-Parallelization Gedik, Buğra, et al. "Elastic scaling for data stream processing." IEEE Transactions on Parallel and Distributed Systems, 2014. Shah, Mehul A., et al. "Flux: An adaptive partitioning operator for continuous query systems." ICDE, 2003.
  • 56. 56 State Management Examples: Fault Tolerance
  • 57. 57 Part IV: Future Research Directions
  • 58. 58 IoT-induced Requirements for Stream Processing Platforms Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World-Wide Streams Platform.“ ACM DEBS, 2017.
  • 59. 59 Nokia Bell Lab’s World Wide Streams (WWS) Platform: Bird’s Eye View XStream Language & XStream Studio DeployerDeployer Placement Algorithm Site Monitor Media Processor Processing Sites XStream Processor Geo Processor Media Server Message Broker StreamBridge Dispatcher Registry Gateway Compiler Orchestration LayerExternal Interfaces Architecture
  • 60. 60 Reference • Esmaili, Kyumars Sheykh. "Reflections on Almost Two Decades of Research into Stream Processing.” ACM DEBS, 2017. • Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World- Wide Streams Platform.” ACM DEBS, 2017.