[ADBIS 2021] - Optimizing Execution Plans in a Multistore

ADBIS’2021
Optimizing execution plans
in a multistore
Chiara Forresi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli
University of Bologna, Cesena, Italy
25th European Conference on Advances in Databases and Information Systems (ADBIS 2021)

ADBIS’2021
What is a multistore?
Chiara Forresi – University of Bologna 2
Introduction
Forresi, C., Gallinucci, E., Golfarelli, M. et al. A dataspace-based framework for OLAP analyses in a high-variety multistore. The VLDB Journal (2021)

ADBIS’2021
Common problems in a multistore
- Data models heterogeneity: different
conventions for representing data
(normalization, denormalization and nesting
of the data)
- Some systems handle this kind of heterogeneity
- A naïve approach consists in transforming all the data
into the same reference model [8]
- Multimodel systems [3] support several data models
within the same platform
- Multistore and polystore systems [21] provide
integrated access and querying to several
heterogeneous stores through a mediator layer
(middleware). E.g., BIGDAWG [10], TATOOINE [6],
CloudMDsQL [14]
Introduction
[8] DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: 2016 ACM SIGMOD Int. Conf. on Management of Data. pp. 295–310. ACM (2016)
[3] Bimonte, S., Gallinucci, E., Marcel, P., Rizzi, S.: Data variety, come as you are in multi-model data warehouses. Information Systems (2021)
[21] Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: A survey. In: 2017 IEEE Int. Conf. on Big Data. pp. 3211–3220. IEEE Computer Society (2017)
[10] Gadepally, V., et al.: The bigdawg polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conf. pp. 1–6. IEEE (2016)
[6] Bonaque, R., et al.: Mixed-instance querying: a lightweight integration architecture for data journalism. Proc. VLDB Endow.9(13), 1513–1516 (2016)
[14] Kolev, B., et al.: Cloudmdsql: querying heterogeneous cloud data stores with a common language. Distributed and Parallel Databases 34(4), 463–503 (2016)

ADBIS’2021
Common problems in a multistore
- Schema heterogeneity: missing attributes, attributes with different name
or type that represents the same concept
- Record overlapping
Data fusion [5, 16] is not applied directly in related works
- QUEPA [15] considers it
Introduction
ID Name Age
1 Denis 21
2 Anne 40
T1 ID Name Age
1 Philip 20
3 Michelle 25
T2
ID Name Age
1 Denis 21
2 Anne 40
T1 ID FirstName Creation date
3 Philip 2021-12-09
4 Michelle 2020-02-18
T2
[5] Bleiholder, J., Naumann, F.: Data fusion. ACM computing surveys (CSUR)41(1),1–41 (2009)
[16] Mandreoli, F., Montangero, M.: Dealing with data heterogeneity in a data fusion perspective: Models, methodologies, and algorithms. In: Data Handling in Science and Technology, vol. 31, pp. 235–270. Elsevier (2019)
[15] Maccioni, A., Torlone, R.: Augmented access for querying and exploring a polystore. In: 34th IEEE Int. Conf. on Data Engineering, ICDE 2018. pp. 77–88. IEEE Computer Society (2018)

ADBIS’2021
How to solve schema heterogeneity?
Pay-as-you-go integration [13]
- Overcome traditional integration problems within an iterative and light integration
- Dataspace [9]: logical view of available datasets. Consisting of: mapping, feature and entity
- The computation is done at middleware level
ID Name
T1
CustId CustName
T2
ID
Name
Customer
[13] Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: 2008 ACM SIGMOD Int. Conf. on Management of Data. pp.847–860. ACM (2008)
[9] Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Record34(4), 27–33 (2005)
Querying
a
dataspace

ADBIS’2021
Querying a dataspace
GPSJ [12] queries formulated on features
Querying
a
dataspace
[12] Golfarelli, M., Maio, D., Rizzi, S.: The dimensional fact model: A conceptual model for data warehouses. Int. J. Cooperative Inf. Syst.7(2-3), 215–247 (1998)
The query:
«For each product name, compute the average of the
quantities purchased by female customers starting from
2019»
is equivalent to the GPSJ query with:
- projections: {"#$%&'( )*+,}
- aggregations: {(/&*)(0(1, *34 ())}
- selections: {(%*(,, ≥, "2019/01/01"), (4,)%,#, =, "F" )}

ADBIS’2021
How to solve record overlapping?
Data fusion operations:
- Merge (⨆) and simultaneous unnest ( ̅
A) operations
- Conflict resolution (at the various levels of heterogeneity exposed)
- Shape of the result
- The simultaneous unnest operator is used to manage data fusion with nested data
ID Name Age
1 Denis 21
2 Anne 40
T1
ID FirstName Creation date
1 Philip 2021-12-09
3 Michelle 2020-02-18
T2
ID FirstName Age Creation date
1 Philip 21 2021-12-09
2 Anne 40 -
3 Michelle - 2020-02-18
Result
Querying
a
dataspace
⨆

ADBIS’2021
How to build a query plan
- The rationale of our previous work was:
“First solve heterogeneity at the entity level,
then join all entities”
- Drawbacks of the approach
- It does not take advantage of the original data modeling
- Limited push down of local operations
- Multiple accesses are made to collections containing
more entities
- The goal of this work is to study different
strategies that take these factors into
account
Querying
a
dataspace

ADBIS’2021
Query plans and optimization
Strategies are based on reference schemas representations:
- Normalized (NoS)
Querying
a
dataspace

ADBIS’2021
- Normalized (NoS)
- Flat (FlS)
Querying
a
dataspace

ADBIS’2021
- Normalized (NoS)
- Flat (FlS)
- Nested (NeS)
Querying
a
dataspace

ADBIS’2021
Focus on a scenario with two entities (customers and orders)
Different ways to join collections
- Depending on the schema representation
- Depending on the need to solve record overlapping
Querying
a
dataspace

ADBIS’2021
Normalized Join (NoJ)
CX1 CY1 CX2 CY2
φX = true, φY = true
⋈
CX1 CY1 CX2 CY2 CX1 CY1 CX2 CY2
∪ ∪ ∪
φX = true, φY = false φX = false, φY = false
⋈ ⋈

ADBIS’2021
Nested Join (NeJ)
φX = true, φY = true φX = true, φY = false φX = false, φY = false
CXY1 CXY2
"
µ
CXY1 CXY2
$
∪
CXY1 CXY2
∪
$
∪

ADBIS’2021
Flat Join (FlJ)
φX = true, φY = true φX = true, φY = false φX = false, φY = false
CYX1 CYX2
ϒ ϒ
⋈
CYX1 CYX2 CYX1 CYX2
∪
ϒ ϒ
∪
⋈

ADBIS’2021
Execution plans are composed by two steps
1. A schema alignment step
2. A joining step
Several execution plans can be devised
depending on
- The chosen join strategy
- The engines (local DBMS or middleware)
that carry out schema alignment
Querying
a
dataspace

ADBIS’2021
Our cost model
A cost model is necessary to choose the most efficient plan
Creating a cost model in a multistore context is not trivial
- The query is computed from a heterogeneous set of engines
- Each engine has its own resources and capabilities
Our cost model
- Is a custom cost model, based on mathematical formulas
- Is focused on disk IO (number of pages read and written)
- Enables simulations without the effort to generate data
Cost
model

ADBIS’2021
Our cost model
Cost calculation:
C$D( E =
C$D( E!
)!
+ +*G " # $,&,' C$D((E")
Cost
model

ADBIS’2021
Experiments
Focus on a scenario with two entities (customers and orders)
- 3 plans, one for each join strategy
- computation pushed down to local DBMSs as much as possible
- We tested more than 3000 configurations varying:
- Number of records of each entity
- Ordering key
- Record overlapping
- The plan showing the actual lowest execution time is also the one with
lowest estimated cost in 83% of cases. The remaining cases induces a
10% of execution overhead
- Conversely, always choosing a single join strategy returns an average overhead between
77% and 127%
Evaluation
&
Conclusion

ADBIS’2021
Alignment evaluation
- NeS is the cheapest to move away from and the most expensive to move
towards
- FlS is the most expensive to move away from
- MongoDB joins are too expensive
- Cassandra has a lack of support for certain operations
unclustered
clustered
Evaluation
&
Conclusion

ADBIS’2021
Join/full evaluations
- Execution costs for NoJ are the most stable in each overlap scenario
- In the no overlap scenario FlJ is clearly the winner
- In the partial overlap scenario NeJ wins (as to previous alignment step)
and in total overalap scenario NoJ wins but it depends of the imbalance of
data towards a certain schema representation
full
join
Evaluation
&
Conclusion

ADBIS’2021
Future developments
- Work with a broader set of entities and compose the final result with the
combination of each join strategy
- Work with a broader set of queries
- Develop a dynamic cost model that takes into account aspects such as:
- CPU and network
- Distribution of the resources intra and inter databases
- Resources’ changes over the time
Evaluation
&
Conclusion

ADBIS’2021
Questions?
Thank you.

[ADBIS 2021] - Optimizing Execution Plans in a Multistore

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie [ADBIS 2021] - Optimizing Execution Plans in a Multistore

Ähnlich wie [ADBIS 2021] - Optimizing Execution Plans in a Multistore (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

[ADBIS 2021] - Optimizing Execution Plans in a Multistore