SlideShare ist ein Scribd-Unternehmen logo
1 von 24
This project has received funding from the European Union’s Horizon 2020
research and innovation program under grant agreement No. 688191.
@tobiaslindener
lindener@kth.se
Approximate Standing
Queries on Apache Flink
Overview
1. Introduction
2. Background
3. Design & Implementation
4. Results
04.09.2018 2
It is better to use a crude approximation and know the truth, plus or minus
10 percent, than demand an exact solution and know nothing at all
In Arthur Bloch, The Complete Murphy's Law: A Definitive Collection (1991), 126
04.09.2018 3
Unbounded Stream
04.09.2018 4
ID_A
P_A
Key Count
A 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 5
ID_A
P_A
ID_B
P_A
Key Count
A 1
B 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 6
ID_A
P_A
ID_C
P_A
ID_B
P_A
Key Count
A 1
B 1
C 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 7
ID_A
P_A
ID_C
P_A
ID_D
P_A
ID_B
P_A
Key Count
A 1
B 1
C 1
D 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 8
ID_A
P_A
ID_C
P_A
ID_D
P_A
ID_C
P_A
ID_B
P_A
Key Count
A 1
B 1
C 2
D 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 9
ID_A
P_A
ID_C
P_A
ID_D
P_A
ID_C
P_A
ID_B
P_A
Key Count
A 1
B 1
C 2
D 1
Time
Infinite
Growth
Approximation Algorithms
Use-cases
• Identify heavy hitters (Count)
• Cardinality Estimation (Count Distinct)
Algorithms
• Frequent Item Estimation
• HyperLogLog
• Quantile Estimation
04.09.2018 10
Processing
04.09.2018 11
Flink Architecture (Apache Software Foundation, 2018)
Approximate Queries
Processing
04.09.2018 12
Flink Architecture (Apache Software Foundation, 2018)
Approximate
Queries
Approximate Query API (1)
04.09.2018 13
Approximate Query API (2)
04.09.2018 14
Sketch Capacity
04.09.2018 15
Estimate Deviation - Frequent Items
WIKITRACE DATASET TOP 6000 URL AMAZON DATASET TOP 1000 REVIEWER
04.09.2018 16
Estimate Deviation - HyperLogLog
04.09.2018 17
Native
Implementation
• Potentially increased efficiency
• Reduced overhead
• Stateful processing
Integration with
Table API
• CQL with Approximate Queries
• Support for Count Distinct
Queryable State • Reduced data handling
Future Work
04.09.2018 18
Queryable State
04.09.2018 19
Stream Elements Query Results
Sketch
Function
Time
Queryable State
04.09.2018 20
Stream Elements Query Results
Sketch
Function
Time
Query Results
REST API
Conclusion
CHALLENGES
• Efficient Grouping (HLL)
• Stateful Native Implementation
• Skewed Datasets
LEARNINGS
• Importance of Data
Distribution
• Performance Advantages
• Algorithm Parameters
04.09.2018 21
Team
04.09.2018 22
Tobias Lindener
Consultant @ Netlight
Theodore Vasiloudis
Researcher @ RISE
Paris Carbone
PhD candidate @ KTH
Slides
bit.ly/2LULyoZ
References & Links
• https://github.com/tlindener/ApproximateQueries
• https://datasketches.github.io/
• Daniel Anderson, Pryce Bevan, Kevin Lang, Edo Liberty, Lee
Rhodes, Justin Thaler. A High-Performance Algorithm for
Identifying Frequent Items in Data Streams.
• Kevin Lang, Back to the Future: an Even More Nearly Optimal
Cardinality Estimation Algorithm.
04.09.2018 23
Evaluation Environment
▪ WikiTrace Dataset (9 GB)
▪ Address (6,708,723 distinct urls)
▪ Amazon Rating Dataset (3 GB)
▪ ProductId (9,874,210 distinct
items)
▪ ReviewerId (21,176,521 distinct
users)
▪ Ryzen 1600 (6C/12T)
▪ 16GB RAM
▪ Ubuntu 18.04
▪ OpenJDK 8
▪ JVM tuned for max 10 GB heap
▪ Flink 1.4.2
▪ Standalone mode
▪ Evaluation through python scripts
04.09.2018 24

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Just add Imagination
Just add ImaginationJust add Imagination
Just add Imagination
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
 
AWS IoT Day - Using AWS IoT Core
AWS IoT Day - Using AWS IoT CoreAWS IoT Day - Using AWS IoT Core
AWS IoT Day - Using AWS IoT Core
 
Beginners guide to weather and climate data
Beginners guide to weather and climate dataBeginners guide to weather and climate data
Beginners guide to weather and climate data
 
Open Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME SoftwareOpen Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME Software
 
Sharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME ServerSharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME Server
 
From Raw Data to Deployment
From Raw Data to DeploymentFrom Raw Data to Deployment
From Raw Data to Deployment
 
Get Your Aircraft Spare Parts Inventory Management Off the Ground
Get Your Aircraft Spare Parts Inventory Management Off the GroundGet Your Aircraft Spare Parts Inventory Management Off the Ground
Get Your Aircraft Spare Parts Inventory Management Off the Ground
 
Stack for high performance in data-intensive operations
Stack for high performance in data-intensive operationsStack for high performance in data-intensive operations
Stack for high performance in data-intensive operations
 
REF compliance - what Jisc is doing
REF compliance - what Jisc is doingREF compliance - what Jisc is doing
REF compliance - what Jisc is doing
 
ODSC Europe: Weather and Climate Data: Not Just for Meteorologists
ODSC Europe: Weather and Climate Data: Not Just for MeteorologistsODSC Europe: Weather and Climate Data: Not Just for Meteorologists
ODSC Europe: Weather and Climate Data: Not Just for Meteorologists
 
Efficiency analyzer for wallblower iiot_quest
Efficiency analyzer for wallblower iiot_questEfficiency analyzer for wallblower iiot_quest
Efficiency analyzer for wallblower iiot_quest
 
Introduction to the IBM Watson Data Platform
Introduction to the IBM Watson Data PlatformIntroduction to the IBM Watson Data Platform
Introduction to the IBM Watson Data Platform
 
Acoustic io t rail monitoring.pptx
Acoustic io t rail monitoring.pptxAcoustic io t rail monitoring.pptx
Acoustic io t rail monitoring.pptx
 
GI2016 ppt charvat senslog api as tools for collection of big vgi data
GI2016 ppt charvat senslog api as tools for collection of big vgi dataGI2016 ppt charvat senslog api as tools for collection of big vgi data
GI2016 ppt charvat senslog api as tools for collection of big vgi data
 
DSD-INT 2018 Impact of flooding on critical infrastructures - Mulder
DSD-INT 2018 Impact of flooding on critical infrastructures - MulderDSD-INT 2018 Impact of flooding on critical infrastructures - Mulder
DSD-INT 2018 Impact of flooding on critical infrastructures - Mulder
 
Smart orchestrator for pipeline processing chain applied to space data cwin18...
Smart orchestrator for pipeline processing chain applied to space data cwin18...Smart orchestrator for pipeline processing chain applied to space data cwin18...
Smart orchestrator for pipeline processing chain applied to space data cwin18...
 
Ground Penetrating Radar (GPR) | Military | UXO
Ground Penetrating Radar (GPR) | Military | UXOGround Penetrating Radar (GPR) | Military | UXO
Ground Penetrating Radar (GPR) | Military | UXO
 
re:Inventから見えたDeepRacer Leagueで勝つための心構え
re:Inventから見えたDeepRacer Leagueで勝つための心構えre:Inventから見えたDeepRacer Leagueで勝つための心構え
re:Inventから見えたDeepRacer Leagueで勝つための心構え
 
Integrating vert.x v2
Integrating vert.x v2Integrating vert.x v2
Integrating vert.x v2
 

Ähnlich wie Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on Stream Processing"

Ähnlich wie Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on Stream Processing" (20)

Network of Networks Share Group Spring Update
Network of Networks Share Group Spring UpdateNetwork of Networks Share Group Spring Update
Network of Networks Share Group Spring Update
 
Integrating, exposing and managing distributed data with RESTful APIs and op...
Integrating, exposing and managing distributed data with RESTful APIs and op...Integrating, exposing and managing distributed data with RESTful APIs and op...
Integrating, exposing and managing distributed data with RESTful APIs and op...
 
State of enterprise data science
State of enterprise data scienceState of enterprise data science
State of enterprise data science
 
Cips Meetup Auckland-Mirko-Kleiner-lean-agile-procurement-201904-1.0
Cips Meetup Auckland-Mirko-Kleiner-lean-agile-procurement-201904-1.0Cips Meetup Auckland-Mirko-Kleiner-lean-agile-procurement-201904-1.0
Cips Meetup Auckland-Mirko-Kleiner-lean-agile-procurement-201904-1.0
 
K8s & cloud native past, present and future
K8s & cloud native past, present and futureK8s & cloud native past, present and future
K8s & cloud native past, present and future
 
digitalization in oil & gas at an inflection point
digitalization in oil & gas at an inflection pointdigitalization in oil & gas at an inflection point
digitalization in oil & gas at an inflection point
 
K8s & cloud native past, present and future
K8s & cloud native past, present and futureK8s & cloud native past, present and future
K8s & cloud native past, present and future
 
TechEvent DWH Modernization
TechEvent DWH ModernizationTechEvent DWH Modernization
TechEvent DWH Modernization
 
Jax London 2018: "Testing Microservices from Development to Production"
Jax London 2018: "Testing Microservices from Development to Production"Jax London 2018: "Testing Microservices from Development to Production"
Jax London 2018: "Testing Microservices from Development to Production"
 
IoT Market in Canada
IoT Market in CanadaIoT Market in Canada
IoT Market in Canada
 
Artificial Intelligence and the Cognitive Revolution – the next frontier?
Artificial Intelligence and the Cognitive Revolution – the next frontier?Artificial Intelligence and the Cognitive Revolution – the next frontier?
Artificial Intelligence and the Cognitive Revolution – the next frontier?
 
Early adopter group and closing of webinar - João Fernandes (CERN)
Early adopter group and closing of webinar - João Fernandes (CERN)Early adopter group and closing of webinar - João Fernandes (CERN)
Early adopter group and closing of webinar - João Fernandes (CERN)
 
Produktdatenmanagement mit Neo4j
Produktdatenmanagement mit Neo4jProduktdatenmanagement mit Neo4j
Produktdatenmanagement mit Neo4j
 
Technip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matterTechnip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matter
 
Enabling Edge Analytics of IoT Data: The Case of LoRaWAN
Enabling Edge Analytics of IoT Data: The Case of LoRaWANEnabling Edge Analytics of IoT Data: The Case of LoRaWAN
Enabling Edge Analytics of IoT Data: The Case of LoRaWAN
 
INTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdfINTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdf
 
Repository Power: How Repositories can support Open Access Mandates (OR2015 O...
Repository Power: How Repositories can support Open Access Mandates (OR2015 O...Repository Power: How Repositories can support Open Access Mandates (OR2015 O...
Repository Power: How Repositories can support Open Access Mandates (OR2015 O...
 
Easy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance ProfessionalEasy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance Professional
 
Flink Meetup Septmeber 2017 2018
Flink Meetup Septmeber 2017 2018Flink Meetup Septmeber 2017 2018
Flink Meetup Septmeber 2017 2018
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
 

Mehr von Flink Forward

Mehr von Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on Stream Processing"