Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
1. Selectivity Estimation for Hybrid Queries over Text-Rich
Data Graphs
Andreas Wagner, Veli Bicer, and Duc Thanh Tran
EDBT/ICDT’13
Institute of Applied Informatics and Formal Description Methods (AIFB)
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association www.kit.edu
2. Introduction and Motivation
Selectivity Estimation for Text-Rich Data Graphs
Evaluation Results
2 Institute of Applied Informatics and Formal
Description Methods (AIFB)
4. Text-Rich Data-Graphs and Hybrid Queries
Increasing amount of semi-structured, text-rich data:
Structured data with
unstructured texts
(e.g., [1]).
Structure Unstructed data Text
annotated with structured
information (e.g., [2]).
[1] DBpedia – A Crystallization
Point for the Web of Data.
[2] http://webdatacommons.org.
4 Andreas Wagner, Veli Bicer, and Duc Thanh Tran Institute of Applied Informatics and Formal
Description Methods (AIFB)
5. Text-Rich Data-Graphs and Hybrid Queries (2)
Focus of our work: conjuctive, hybrid queries
relation attribute
?x ?y „keyword“
structured query predicates unstructured query predicates
„string“ (query) predicates
Structure Text
5 Institute of Applied Informatics and Formal
Description Methods (AIFB)
6. Problem Definition
Problem: Efficiently and effectively estimate the result set size
for a conjuctive, hybrid query Q.
[5] Selectivity estimation
Decompose problem: sel(Q) = R(Q) * P(Q), [5]. using probabilistic models.
R(Q): upper-bound cardinality for result set.
P(Q): probability for Q having an non-empty result.
Correlation between query predicates (data elements) make
approximation of P(Q) hard.
Correlations Correlations
relation attribute
?x relation ?y attribute „keyword“
relation attribute „keyword“
„keyword“
Correlations
Correlations make estimations relying on
6
„indepence assumptions“ error-prone ! Institute of Applied Informatics and Formal
Description Methods (AIFB)
7. Contributions
Previous works focuses either on structured or on unstructured
query constraints.
- Graph synopses [3] Correlations Correlations
- Join samples [4] ?x relation ?y attribute „keyword“
relation
relation „keyword“
„keyword“
- Fuzzy string matching [7,8]
- PRMs [5,6]
- Extraction operators [9,10]
-…
-…
Correlations
We introduce a uniform model (BN+) for hybrid queries:
Instance of template-based BN well-suited for graph-structed data.
Extend BN with string synopses for estimation of string predicates.
7 Institute of Applied Informatics and Formal
Description Methods (AIFB)
8. SELECTIVITY ESTIMATION FOR
TEXT-RICH DATA GRAPHS
8 Institute of Applied Informatics and Formal
Description Methods (AIFB)
9. Preliminaries (1) – Data and Query Model
Data Attribute
Class Value Node
Node
Bag of N-
Grams
Relation
Edge Attribute
Entity Node Edge
Query Relation
Predicate
Keyword
Node
contains String
Predicate
9 Institute of Applied Informatics and Formal
Description Methods (AIFB)
10. Preliminaries (2) – Bayesian Networks (1) sel(Q) = R(Q) * P(Q)
Recall:
Bayesian Network (BN) provides means for capturing joint
probability distributions (e.g., P(Q)).
BN comprise network structure and parameters.
Nodes = random variables.
Edges = dependencies .
10 Institute of Applied Informatics and Formal
Description Methods (AIFB)
11. Preliminaries (3) – Bayesian Networks (2)
BN comprise network structure and parameters.
11 Institute of Applied Informatics and Formal
Description Methods (AIFB)
12. Preliminaries (4) – Bayesian Networks (3)
Template-based BNs: templates and template factors [16].
Template is a function Χ(α1,…,αk), and each argument αi is a place-
holder to be instantiated to obtain random variables.
Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}.
Entity skeleton for Xperson = {p1,p2,p3} .
Template factors define probability distributions shared by all
instantiated random variables of a given template.
12 Shared by all instantiations of XdirectedBy Institute of Applied Informatics and Formal
Description Methods (AIFB)
13. Template-Based BN for Graph-structured Data
We define a templates for each …
Attribute a, Xa(α1). Entity skeleton: all entities having attribute a.
Class c, Xc(α1). Entity skeleton: all entities belonging to class c.
Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target”
entities having relation r.
Template for relation spouse.
Template for attribute title.
Template for class person.
- PRMs [5,6]
-…
Dynamic partitioning based on
Advantages
entity skeletons.
13 Template representation is compact. of Applied Informatics and Formal
Institute
Description Methods (AIFB)
14. - Fuzzy string matching [7,8]
- Extraction operators [9,10]
Integration of String Synopses (1) -…
Problem: Large sample space for attribute-based templates.
Entire n-gram space as Ω.
In order to compactly represent Ω, being a large set of strings, we
use string synopses (e.g., [7,8,9,10]).
Intuitively, for an attribute-based template a string synopsis does:
a) Decide how to “compactly represent” Ω.
b) Compute probabilities for strings given its compact space.
Some synopses even allow to “guess”
probabilities for unknown strings.
14 Institute of Applied Informatics and Formal
Description Methods (AIFB)
15. Integration of String Synopses (2)
[10] Selectivity estimation for
extraction operators over text data.
In this work, we use n-gram-based synopses [10].
Consider, e.g., top-k n-gram synopsis [10].
Compute n-gram counts and store only top-k n-grams.
Probabilities for known n-grams are exact.
Omitted n-grams are estimated based on heuristics using known n-
grams.
15 Institute of Applied Informatics and Formal
Description Methods (AIFB)
16. Learning of BN+ (1): Structure (1)
Similar technique has been [11] Approximating discrete
recently applied for “Lightweight probability distributions with
PRMs” [6]. dependence trees.
Simplify structure via product approximation using trees [11,12].
Fixed Structure Assumption:
a) Two templates X1 and X2 are conditionally independent given their
parents, if they do not share a common entity in their skeletons
b) Each class template Xc has no parent.
c) Each relation template Xr is independent of any class template Xc,
given its parents.
16 Institute of Applied Informatics and Formal
Description Methods (AIFB)
17. Learning of BN+ (2): Structure (2)
Template Model
Using fixed structure allows to decompose structure learning:
„Local“ correlations between attribute/class (e.g., Xmovie → Xtitle)
Reduce network structure to only capture “most important”
correlations via maximal spanning forest.
Relation templates connect different trees.
Overall, network structure is determined by „overlapping“ entity
skeletons and fixed structure assumption.
17 Institute of Applied Informatics and Formal
Description Methods (AIFB)
18. Learning of BN+ (3): Parameters
Based on the learned structure, parameters are learned via
collecting sufficient statistics (i.e., frequency counts).
Speed up parameter learning via:
Using queries to obtain sufficient statistics.
Using caching during structure / parameter learning.
18 Institute of Applied Informatics and Formal
Description Methods (AIFB)
19. Estimating P(Q) using BN+ (1)
At runtime, templates are instantiated to construct a query-
specific ground BN.
Template Model
Query
Query-specific Ground BN
19 Assignment is a string synopsis element. Institute of Applied Informatics and Formal
Description Methods (AIFB)
20. Recall: sel(Q) = R(Q) * P(Q)
Estimating P(Q) using BN+ (2)
Given a query-specific ground BN, we use inferencing to obtain the
joint probability P(Q).
Query-specific Ground BN
20 “Correction” using string synopsis.
Institute of Applied Informatics and Formal
Description Methods (AIFB)
21. EVALUATION
21 Institute of Applied Informatics and Formal
Description Methods (AIFB)
22. Evaluation (1) – Setting
Data: IMDB [14] and DBLP [15].
IMDB featured more correlations than DBLP.
Different results between DBLP and IMDB show „relative benefit“.
Queries: recent keyword search benchmarks [13,14] . We
employed 54 DBLP queries and 46 IMDB queries.
[13] Spark2: Top-k keyword
query in relational data-
Systems: bases.
We used n-gram-based string synopses [10]:
[14] A framework for
random samples of 1-grams, evaluating database key-
top-k 1-grams, word search strategies.
stratified bloom filters on 1-grams.
String predicates were integrated via (1) independence (ind) or (2)
conditional independence (bn) assumption.
22 Institute of Applied Informatics and Formal
Description Methods (AIFB)
23. Evaluation (2) – Setting (2)
Synopsis size:
Overall synopsis size depends mainly on string synopsis size.
Synopses sizes ∈ {2, 4, 20, 40} MByte memory.
Metrics:
Efficiency: selectivity estimation time.
Effectiveness: multiplicative error [17].
[17] Independence is good: De-
pendency-based histogram syno-
pses for high-dimensional data.
23 Institute of Applied Informatics and Formal
Description Methods (AIFB)
24. Evaluation (3) – Effectiveness – IMDB
24 Institute of Applied Informatics and Formal
Description Methods (AIFB)
25. Evaluation (4) – Effectiveness – DBLP
25 Institute of Applied Informatics and Formal
Description Methods (AIFB)
26. Evaluation (5) – Efficiency
26 Institute of Applied Informatics and Formal
Description Methods (AIFB)
27. CONCLUSION
27 Institute of Applied Informatics and Formal
Description Methods (AIFB)
28. Conclusion
Tackled the problem of selectivity estimation for conjunctive,
hybrid queries.
We propose a template-based BN, which is well-suited for
graph-structured data.
For string predicates, we further propose the integration of
string synopses into this model.
Experiments showed that:
If there are correlations between un-/structured data elements the
accuracy of selectivity estimation can be greatly improved via BN+.
BN caused no overhead in terms of efficiency.
28 Institute of Applied Informatics and Formal
Description Methods (AIFB)
29. QUESTIONS
29 Institute of Applied Informatics and Formal
Description Methods (AIFB)
30. REFERENCES
30 Institute of Applied Informatics and Formal
Description Methods (AIFB)
31. References
[1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal
of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7,
Pages 154–165, 2009.
[2] http://webdatacommons.org/
[3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for
approximate query answering. In SIGMOD, pages 275–286, 1999.
[4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity
estimation. In SIGMOD, pages 205–216, 2006.
[5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models.
In SIGMOD, pages 461–472, 2001.
[6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for
selectivity estimation without independence assumptions. PVLDB, 4(11):852–863,
2011.
[7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates:
Overcoming the underestimation problem. In ICDE, pages 227–238, 2004.
[8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In
VLDB, pages 397–408, 2005.
31 Institute of Applied Informatics and Formal
Description Methods (AIFB)
32. References (2)
[9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information
extraction using datalog with embedded extraction predicates. In VLDB, pages
1033–1044, 2007.
[10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for
extraction operators over text data. In ICDE, pages 685–696, 2011.
[11] C. Chow and C. Liu. Approximating discrete probability distributions with
dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968.
[12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine
Learning Research, 1:1–48, 2001.
[13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword
query in relational databases. IEEE Transactions on Knowledge and Data
Engineering, 23(12):1763–1780, 2011.
[14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword
search strategies. In CIKM, pages 729–738, 2010.
[15] http://knoesis.org/swetodblp/
[16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.
[17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good:
Dependency-based histogram synopses for high-dimensional data. In SIGMOD,
pages 199-210, 2001.
32 Institute of Applied Informatics and Formal
Description Methods (AIFB)
Hinweis der Redaktion
Queries contains query predicates for structured and unstructured query constraints Resembling SPARQL queries with FILTER contains function unstructured query predicates = string predicates
* However, effective estimation of P(Q) is important for query optimizers relying on accurate estimates for intermediate query results.
Graphical representation of a set of cond. IndsExpress a factorization of the joint distribution
* Given a template X(α1,…, αn), an entity skeleton of X is defined as E(α1, . . . , αn) ⊆ E(α1) × … × E(αn),where each E(αi) ⊆ VE specifies all possible entity assignments to αi .
In a relational context, data is stored in tables corresponding to relations captured by a conceptual model. Further, relation names are explicitly given in a query – stated in a FROM clause. Correspondingly, previous works [10, 23] employ a PRM to model selection predicates through randomvariables of the form XR.A, where R is a relational table andA is an attribute. For instance, XPerson.name = “Audrey” is a random variable capturing a selection on table Person where name equals “Audrey”. Analogously, join predicates are modeled as binary random variables that involve two explicitly specified tables. Further, schema information may be queried via class predicates, which are not supported in the relational setting.
Inferencing costs are driven by two factors: (1) dependency structure of a BN, and (2) sample space sizes. Existing works on PRMs have focused on the former, targeting a lightweight, tree-shaped BN structure [23]. The latter aspect, however, is crucial as CPD sizes are a mere reflection of sample space sizes. Essentially, for supporting string predicates with all possible keywords, Ω(Xa) must capture allwords and phrases, which occur in a’s values. In order to compactly represent Ω, being a large set of strings, we propose the use of string synopses such as Markov tables [4], histograms [13] or n-gram synopses [25].
Then, the space Ba is reduced by using a decision criterion to dictate which n-grams ∈ Ba to include in a synopsis sample space Ω(Xa ). That is, a synopsis space represents a subset of “important” n-grams. Note, n-gram synopses are most accurate, as each synopsis element represents exactly one n-gram ∈ Ba – in contrast to, e.g., histograms. Recent work has outlined several such decision criteria [25].
Recently applied for PRMs [6].We impose that strong correlations among templates only occur, if they share some common entities – they need to “talk about the same things” (Def. 2-a). We argue that there is a causal dependence (independence) between a class and an attribute (relation) template (Def. 2-b, -c). In other words, assigningan entity to a given class causally affects the probability of its attribute values, which in turn, influences the probability of observing a particular relation
Using fixed structure allows to decompose structure learning: First learn “local” correlations between attribute/class template..Reduce network structure to only capture “most important” correlations via maximal spanning forest.Connect forest of trees via relational templates.
Such a template-based approach has the merit of being compact. The number of templates is far less than the number of random variables in a ground BN. Structure and parameters (CPDs) are learned for templates only. At runtime, templates are instantiated with entities to construct a ground BN. For inferencing, a CPD learned for a template is shared among all random variables in the ground BN that instantiate that template.
Missing Synopsis Values Multiple Value Assignments
DBLP as well as IMDB hold text-rich attributes like name, label or info. However, IMDB contains more text. Strong correlations in IMDB data between/among text and/or structure. In particular, we noticed strong dependencies during structure learning between values of attributes such as label and info. Hypothesis that assuming independence hurts the quality of selectivity estimates, given datasets that exhibit correlations. We also used DBLP, which on the other hand, shows almost no such correlations. Using DBLP data, we expect accuracy differences to be less significant. Our workload includes queries containing [2, 11] predicates in total: [0, 4] relation, [1, 7] string, and [1, 4] class predicates (cf. Tab. 2).
Key factor driving overall synopsis size was employed string synopsis.Experiments were run on a Linux server with two Intel Xeon 5140 CPUs (each with 2 cores at 2.33GHz), 48GB RAM (with 16GB assigned to the JVM), and a RAID10 with IBM SAS 148GB 10k rpm disks. Before query execution, all OS caches were cleared.
sel(Q) and sel(Q) as exact and estimated selectivity for Q, respectively. Intuitively, me represents the factor at which sel(Q) under-/overestimates sel(Q). Best accuracy results were achieved by ind∗ and bn∗ having a size ≥ 20 MByte, Further, the results confirmed our conjecture that the degree of data correlations has a significant impact on the overall accuracy performance differences between ind∗and bn∗ approaches. That is, a high degree of correlation in the IMDB dataset translated to large accuracy differences, while the improvement bn∗ could achieve over the baseline was small for DBLP. For the IMDB dataset, bnsbf could reduce errors of the indsbf approach by 93 %, while improvements were much smaller given DBLP. We noticed the error to increase in the number of predicates. This effect is expected, as more query predicates (hence more “difficult” queries) lead to an increasingly error-prone probability estimation. An interesting observation is that ind∗ outperformed bn∗ for some queries – see IMDB queries with 5 predicates and DBLP queries with 4 predicates (Fig. 4-b and -f). For instance, given IMDB query Q28, indtop-k achieved 13% better results than bntop-k. In such cases, string query predicates were translated to multiple values (1-grams) that are assigned to one single random variable
For instance, for DBLP queries with string predicates name and label, there are no significant correlations in ourBN. Thus, the probabilities obtained by bn∗ were almost identical to the However, while ind∗ led to fairly good estimates for the overall query load on DBLP, we could achieve more accurate selectivity computations via bn∗ for specific “correlated” queries. For instance, for DBLP query Q1 we could approximate an 10% better selectivity estimation.