Multistores are data management systems that enable query processing across different database management systems (DBMSs); besides the distribution of data, complexity factors like schema heterogeneity and data replication must be resolved through integration and data fusion activities. In a recent work [2], we have proposed a multistore solution that relies on a dataspace to provide the user with an integrated view of the available data and enables the formulation and execution of GPSJ (generalized projection, selection and join) queries. In this paper, we propose a technique to optimize the execution of GPSJ queries by finding the most efficient execution plan on the multistore. In particular, we devise three different strategies to carry out joins and data fusion, and we build a cost model to enable the evaluation of different execution plans. Through the experimental evaluation, we are able to profile the suitability of each strategy to different multistore configurations, thus validating our multi-strategy approach and motivating further research on this topic.
[ADBIS 2021] - Optimizing Execution Plans in a Multistore
1. ADBIS’2021
Optimizing execution plans
in a multistore
Chiara Forresi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli
University of Bologna, Cesena, Italy
25th European Conference on Advances in Databases and Information Systems (ADBIS 2021)
2. ADBIS’2021
What is a multistore?
Chiara Forresi – University of Bologna 2
Introduction
Forresi, C., Gallinucci, E., Golfarelli, M. et al. A dataspace-based framework for OLAP analyses in a high-variety multistore. The VLDB Journal (2021)
3. ADBIS’2021
Common problems in a multistore
- Data models heterogeneity: different
conventions for representing data
(normalization, denormalization and nesting
of the data)
- Some systems handle this kind of heterogeneity
- A naïve approach consists in transforming all the data
into the same reference model [8]
- Multimodel systems [3] support several data models
within the same platform
- Multistore and polystore systems [21] provide
integrated access and querying to several
heterogeneous stores through a mediator layer
(middleware). E.g., BIGDAWG [10], TATOOINE [6],
CloudMDsQL [14]
Chiara Forresi – University of Bologna 3
Introduction
[8] DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: 2016 ACM SIGMOD Int. Conf. on Management of Data. pp. 295–310. ACM (2016)
[3] Bimonte, S., Gallinucci, E., Marcel, P., Rizzi, S.: Data variety, come as you are in multi-model data warehouses. Information Systems (2021)
[21] Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: A survey. In: 2017 IEEE Int. Conf. on Big Data. pp. 3211–3220. IEEE Computer Society (2017)
[10] Gadepally, V., et al.: The bigdawg polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conf. pp. 1–6. IEEE (2016)
[6] Bonaque, R., et al.: Mixed-instance querying: a lightweight integration architecture for data journalism. Proc. VLDB Endow.9(13), 1513–1516 (2016)
[14] Kolev, B., et al.: Cloudmdsql: querying heterogeneous cloud data stores with a common language. Distributed and Parallel Databases 34(4), 463–503 (2016)
4. ADBIS’2021
Common problems in a multistore
- Schema heterogeneity: missing attributes, attributes with different name
or type that represents the same concept
- Record overlapping
Data fusion [5, 16] is not applied directly in related works
- QUEPA [15] considers it
Chiara Forresi – University of Bologna 4
Introduction
ID Name Age
1 Denis 21
2 Anne 40
T1 ID Name Age
1 Philip 20
3 Michelle 25
T2
ID Name Age
1 Denis 21
2 Anne 40
T1 ID FirstName Creation date
3 Philip 2021-12-09
4 Michelle 2020-02-18
T2
[5] Bleiholder, J., Naumann, F.: Data fusion. ACM computing surveys (CSUR)41(1),1–41 (2009)
[16] Mandreoli, F., Montangero, M.: Dealing with data heterogeneity in a data fusion perspective: Models, methodologies, and algorithms. In: Data Handling in Science and Technology, vol. 31, pp. 235–270. Elsevier (2019)
[15] Maccioni, A., Torlone, R.: Augmented access for querying and exploring a polystore. In: 34th IEEE Int. Conf. on Data Engineering, ICDE 2018. pp. 77–88. IEEE Computer Society (2018)
5. ADBIS’2021
How to solve schema heterogeneity?
Pay-as-you-go integration [13]
- Overcome traditional integration problems within an iterative and light integration
- Dataspace [9]: logical view of available datasets. Consisting of: mapping, feature and entity
- The computation is done at middleware level
Chiara Forresi – University of Bologna 5
ID Name
T1
CustId CustName
T2
ID
Name
Customer
[13] Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: 2008 ACM SIGMOD Int. Conf. on Management of Data. pp.847–860. ACM (2008)
[9] Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Record34(4), 27–33 (2005)
Querying
a
dataspace
6. ADBIS’2021
Querying a dataspace
GPSJ [12] queries formulated on features
Chiara Forresi – University of Bologna 6
Querying
a
dataspace
[12] Golfarelli, M., Maio, D., Rizzi, S.: The dimensional fact model: A conceptual model for data warehouses. Int. J. Cooperative Inf. Syst.7(2-3), 215–247 (1998)
The query:
«For each product name, compute the average of the
quantities purchased by female customers starting from
2019»
is equivalent to the GPSJ query with:
- projections: {"#$%&'( )*+,}
- aggregations: {(/&*)(0(1, *34 ())}
- selections: {(%*(,, ≥, "2019/01/01"), (4,)%,#, =, "F" )}
7. ADBIS’2021
How to solve record overlapping?
Data fusion operations:
- Merge (⨆) and simultaneous unnest ( ̅
A) operations
- Conflict resolution (at the various levels of heterogeneity exposed)
- Shape of the result
- The simultaneous unnest operator is used to manage data fusion with nested data
Chiara Forresi – University of Bologna 7
ID Name Age
1 Denis 21
2 Anne 40
T1
ID FirstName Creation date
1 Philip 2021-12-09
3 Michelle 2020-02-18
T2
ID FirstName Age Creation date
1 Philip 21 2021-12-09
2 Anne 40 -
3 Michelle - 2020-02-18
Result
Querying
a
dataspace
⨆
8. ADBIS’2021
How to build a query plan
- The rationale of our previous work was:
“First solve heterogeneity at the entity level,
then join all entities”
- Drawbacks of the approach
- It does not take advantage of the original data modeling
- Limited push down of local operations
- Multiple accesses are made to collections containing
more entities
- The goal of this work is to study different
strategies that take these factors into
account
Chiara Forresi – University of Bologna 8
Querying
a
dataspace
9. ADBIS’2021
Query plans and optimization
Strategies are based on reference schemas representations:
- Normalized (NoS)
Chiara Forresi – University of Bologna 9
Querying
a
dataspace
10. ADBIS’2021
Query plans and optimization
Strategies are based on reference schemas representations:
- Normalized (NoS)
- Flat (FlS)
Chiara Forresi – University of Bologna 10
Querying
a
dataspace
11. ADBIS’2021
Query plans and optimization
Strategies are based on reference schemas representations:
- Normalized (NoS)
- Flat (FlS)
- Nested (NeS)
Chiara Forresi – University of Bologna 11
Querying
a
dataspace
12. ADBIS’2021
Query plans and optimization
Focus on a scenario with two entities (customers and orders)
Different ways to join collections
- Depending on the schema representation
- Depending on the need to solve record overlapping
Chiara Forresi – University of Bologna 12
Querying
a
dataspace
16. ADBIS’2021
Query plans and optimization
Execution plans are composed by two steps
1. A schema alignment step
2. A joining step
Several execution plans can be devised
depending on
- The chosen join strategy
- The engines (local DBMS or middleware)
that carry out schema alignment
Chiara Forresi – University of Bologna 16
Querying
a
dataspace
17. ADBIS’2021
Our cost model
A cost model is necessary to choose the most efficient plan
Creating a cost model in a multistore context is not trivial
- The query is computed from a heterogeneous set of engines
- Each engine has its own resources and capabilities
Our cost model
- Is a custom cost model, based on mathematical formulas
- Is focused on disk IO (number of pages read and written)
- Enables simulations without the effort to generate data
Chiara Forresi – University of Bologna 17
Cost
model
18. ADBIS’2021
Our cost model
Cost calculation:
C$D( E =
C$D( E!
)!
+ +*G " # $,&,' C$D((E")
Chiara Forresi – University of Bologna 18
Cost
model
19. ADBIS’2021
Experiments
Focus on a scenario with two entities (customers and orders)
- 3 plans, one for each join strategy
- computation pushed down to local DBMSs as much as possible
- We tested more than 3000 configurations varying:
- Number of records of each entity
- Ordering key
- Record overlapping
- The plan showing the actual lowest execution time is also the one with
lowest estimated cost in 83% of cases. The remaining cases induces a
10% of execution overhead
- Conversely, always choosing a single join strategy returns an average overhead between
77% and 127%
Chiara Forresi – University of Bologna 19
Evaluation
&
Conclusion
20. ADBIS’2021
Alignment evaluation
- NeS is the cheapest to move away from and the most expensive to move
towards
- FlS is the most expensive to move away from
- MongoDB joins are too expensive
- Cassandra has a lack of support for certain operations
Chiara Forresi – University of Bologna 20
unclustered
clustered
Evaluation
&
Conclusion
21. ADBIS’2021
Join/full evaluations
Chiara Forresi – University of Bologna 21
- Execution costs for NoJ are the most stable in each overlap scenario
- In the no overlap scenario FlJ is clearly the winner
- In the partial overlap scenario NeJ wins (as to previous alignment step)
and in total overalap scenario NoJ wins but it depends of the imbalance of
data towards a certain schema representation
full
join
Evaluation
&
Conclusion
22. ADBIS’2021
Future developments
- Work with a broader set of entities and compose the final result with the
combination of each join strategy
- Work with a broader set of queries
- Develop a dynamic cost model that takes into account aspects such as:
- CPU and network
- Distribution of the resources intra and inter databases
- Resources’ changes over the time
Chiara Forresi – University of Bologna 22
Evaluation
&
Conclusion