The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
Optimization of Continuous Queries in Federated Database and Stream Processing Systems
1. Optimization of Continuous Queries in Federated
Database and Stream Processing Systems
Yuanzhen Ji1, Zbigniew Jerzak1, Anisoara Nica1, Gregor Hackenbroich1,
Christof Fetzer2
1SAP SE 2TU Dresden
1firstname.lastname@sap.com 2christof.fetzer@tu-dresden.de
March 16, 2015 BTW 2015
3. • Problem: optimizing continuous queries (CQ) for federated execution over
a native stream processing engine (SPE) and column-oriented in-memory
database (CIMDB).
– operators: select, join, project, aggregate
• Goal: maximize query throughput (amount of data processed in unit time)
Introduction
3
SPE
CIMDB
data
streams
query
results
data flow
4. Introduction
• Motivation:
– “No one size fits all” (Cyclops[LHB13], [JI13])
– obtain the best of both worlds (SPE, CIMDB)
• Application Scenario:
– analyzing energy consumption data collected from smart plugs
installed in households (DEBS 2014 Grand Challenge)
• Main contributions:
– a static cost-based optimizer for federated systems
• extends established optimization techniques
• considers the feasibility property of CQ
– showed the potential of federated CQ execution over a SPE and a CIMDB
• up to 8.5x as high as throughput of pure SPE based processing
• up to 1.8x as high as throughput of pure CIMDB based processing
4
5. Federated Continuous Query Execution
• send relevant input data from SPE to CIMDB
• trigger re-evaluation of query pieces moved to CIMDB
• take results of query pieces executed in CIMDB back to SPE
5
SPE
CIMDB
data
streams
query
results
SQL
query
MIG
MIG
data flow
6. Query Optimization Problem
• Problem: determine the optimal execution
plan for a given CQ
– currently at deployment time
• Feasibility of continuous queries [AN04]:
– feasible execution plan: can keep up
with data arrival rate
– feasible query: has at least one feasible plan
6
SPE CIMDB
• Feasibility-dependent optimization objective:
– feasible queries: find the feasible plan with least resource consumption
– infeasible queries: find the plan which with maximal throughput
• State of the art: either consider feasibility of CQ but not the federation
context, or the federation context but not the feasibility of CQ.
7. Optimization Solution
Cost Model – Operator Cost (1)
• Operator cost: CPU cost caused by tuples arrived from data sources within
unit-time
For an 𝑂 with k direct upstream operators:
– li: # tuples produced by the i-th upstream operator as a result of
unit-time source arrivals
– ci: time to process a single tuple from the i-th upstream operator
7
𝑢 > 1 bottleneck infeasible plan
𝑢(𝑂) = 𝑖=1
𝑘
li 𝑐𝑖 = l1 𝑐1 + l2 𝑐2
O
l1=300
=200
=0.001
= 0.002l2
c1
c2
= 300* 0.001+ 200 * 0.002 = 0.7
8. Optimization Solution
Cost Model – Operator Cost (2)
• A query piece executed in CIMDB and its corresponding MIG operator:
– treated as a composite operator and cost as a whole
– cost includes data transfer (in & out) cost and query execution cost
8
SPE
CIMDB
data
streams
query
results
SQL
query
MIG
data flow
10. Optimization Solution
Optimal Execution Plan
• An execution plan P of a CQ is an optimal plan, iff for any other plan P’ of
CQ, one of the following conditions is satisfied:
– Condition 1: P is feasible but P’ is infeasible
(Cb(P) ≤ 1 < Cb(P’) )
– Condition 2: Both P and P’ are feasible, but P has lower Cu(P)
(Cb(P) ≤ 1, Cb(P’) ≤ 1, and Cu(P) ≤ Cu(P’) )
– Condition 3: Both P and P’ are feasible, but P has lower Cu(P)
(1 < Cb(P) ≤ Cb(P’) )
10
11. Optimization Solution
Two Phase-Optimization
• Large search space (# possible plans):
– many semantically equivalent logical plans
– A logical plan with n operators -> 2n possible placement decisions
• Two-Phase optimization:
– Phase One: determine the optimal logical plan (consider join ordering,
etc.)
– Phase two: determine placement for each operator in the logical plan
produced in phase-one.
• Bottom-up plan construction following dynamic programming (DP) model
• Proved applicability of DP for feasibility-dependent optimization objective
in paper.
11
12. • For each operator O in a logical plan, the optimal sub-plan until O, where
O is placed in the SPE, can be build from the optimal sub-plans until direct
upstream operators of O.
• For a large logical plan: divide into smaller pieces, optimize and compose
in post order.
Optimization Solution
Pruning in Phase Two
12
I1
𝑶 𝟐
𝑺𝑷𝑬
𝑶 𝟏
𝑺𝑷𝑬
𝑶 𝟐
𝑺𝑷𝑬
𝑶 𝟏
𝑫𝑩 I2
𝐶 𝐼1 < 𝐶 𝐼2
13. Evaluation
Setup
• Setup: HP Z620 workstation with 24-cores (1.2GHz per core) and 96 GB
RAM, running SUSE Linux.
• Data: real-world energy consumption data from smart plugs installed in
households (DEBS 2014 Grand Challenge).
• Tested queries:
13
14. 26.1
3.1
18.7
0
5
10
15
20
25
30
SELECT in
SPE
All in SPE All in DB
Max.throughput(thousand/s)
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Actualthroughput(thousand/s)
Requested throughput (thousand/s)
Evaluation
Optimizer effectiveness (1)
• Examine 10 source stream data rates picked from
range [1,000, 40,000] (tuples/s)
• measure throughput of devised optimal query
14
Max. throughput comparisonActual vs. requested throughput
PROJECT
INNER JOIN
AGGR (avg)
SELECT SELECT
WINDOW
(5 min)
WINDOW
(5 min)
AGGR (cnt)
SELECT IN SPE
15. Evaluation
Optimizer effectiveness (2)
15
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Actualthroughput(thousand/s)
Requested throughput (thousand/s)
18.1
28.6
6.0
18.0
0
5
10
15
20
25
30
SELECT in
SPE
SEL, JOIN,
P in SPE
All in SPE All in DB
Max.throughput(thousand/s)
P1
P2
P1
P2
Max. throughput comparisonActual vs. requested throughput
• Examine data rates ranging from 1000 to 40,000
tuples/s, at 1000 tuples/s increment
• measure throughput of devised optimal query
P1
PROJECT
INNER JOIN
AGGR
(avg, max)
AGGR
(avg, max)
SELECT SELECT
WINDOW
(5 min)
WINDOW
(1 min)
SELECT IN SPE (P1)
SEL, JOIN, P IN SPE (P2)
18. Conclusion
• Exploits the potential of federated execution of CQ over SPE and IMDB.
• Presents a static optimizer which extends traditional optimization
techniques to consider feasibility of CQ.
• Evaluation show promising results.
For examined queries, throughput of devised federated plan is
– up to 8.5 times as high as throughput of pure SPE-based plan
– up to 1.8 times as high as throughput of pure CIMDB-based plan
18
19. References
[AN04] Ayad, A. M. & Naughton, J. F., Static Optimization of Conjunctive Queries with Sliding Windows over
Infinite Streams, SIGMOD, 2004
[FKC+09] Franklin, M. J.; Krishnamurthy, S.; Conway, N.; Li, A., Russakovsky, A. & Thombre, N., Continuous
Analytics: Rethinking query processing in a network-effect world. CIDR, 2009
[KS09] Kraemer, J. & Seeger B., Semantics and implementation of continuous sliding window queries over data
streams, ACM TODS, 2009
[BCD+10] Botan, I.; Cho, Y.; Derakhshan, R.; Dindar, N.; Gupta, A.; Haas, L. M.; Kim, K.; Lee, C.; Mundada, G.;
Shan, M.-C.; Tatbul, N.; Yan, Y.; Yun, B. & Zhang, J. A demonstration of the MaxStream federated stream
processing system. ICDE, 2010
[LMB+10] Liu, M.; Mihaylov, S. R.; Bao, Z.; Jacob, M.; Ives, Z. G.; Loo, B. T. & Guha, S. SmartCIS: integrating
digital and physical environments. SIGMOD Record, 2010
[LIM+12] Liarou, E.; Idreos, S.; Manegold, S. & Kersten, M. MonetDB/DataCell: online analytics in a streaming
column-store, PVLDB, 2012
[LHB13] Lim, H.; Han, Y. & Babu, S. How to Fit when No One Size Fits, CIDR, 2013
[Ji13] Ji, Y., Database support for processing complex aggregate queries over data streams , EDBT Workshops,
2013
[CDK+14] Çetintemel, U.; Du, J.; Kraska, T.; Madden, S.; Maier, D.; Meehan, J.; Pavlo, A.; Stonebraker, M.;
Sutherland, E.; Tatbul, N.; Tufte, K.; Wang, H. & Zdonik, S. B., S-Store: A streaming NewSQL system for big
velocity applications, PVLDB, 2014
[DLB+11] Daum, M.; Lauterwald, F.; Baumgärtel, P.; Pollner, N. & Meyer-Wegener, K., Efficient and Cost-aware
Operator Placement in Heterogeneous Stream-processing Environments, DEBS, 2011
19
22. Semantics
• Adopt the abstract semantics defined in [ABW06], which is based on:
– Two data types:
• Stream (S): a possibly infinite bag of elements <s, t>, where s is a
tuple belonging to the schema of S and t is the timestamp of s.
• Time-varying Relation (R): a mapping from T to a finite but
unbounded bag of tuples belonging to the schema of R.
– Three classes of query operators:
• stream-to-relation (S2R) operators: produce one relation from one
stream (e.g., window operators)
• relation-to-relation (R2R) operators: produce one relation from
one or more relations.
• relation-to-stream (R2S) operators: produce one stream from one
relation.
22
23. SPE
continuous query
streaming data query results
Introduction
From DBMS to SPE
• Increasing interests in processing high-velocity data streams generated in
real-time using continuous queries (CQ).
Need a new processing paradigm
DBMS
one-shot
queries
query results
stored data
23
24. Introduction
From DBMS to SPE
• However, many applications require:
– persisting input streaming data/query results for on-demand analysis
– combining streaming data with static data during processing.
24
DBMS
one-shot
queries
query results
stored data
SPE
continuous query
streaming data query results
store data
access
stored data
25. Introduction
Build SPE on Top of DBMS Kernel
• Exploit and merge technologies from both worlds in an integration way.
– Truviso Continuous Analytics [FKC+09], HP Lab work [CH10], DataCell
[LIM+12], S-Store [CDK+14]
25
SPE + DBMS
one-shot
queries query results
stored data
continuous query
streaming data query results
in-memory
table
buffers
in UDFs