Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Selectivity Estimation for Hybrid Queries over Text-Rich
Data Graphs

Andreas Wagner, Veli Bicer, and Duc Thanh Tran
EDBT/ICDT’13

Institute of Applied Informatics and Formal Description Methods (AIFB)

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association www.kit.edu

Introduction and Motivation

Selectivity Estimation for Text-Rich Data Graphs

Evaluation Results

2 Institute of Applied Informatics and Formal
Description Methods (AIFB)

INTRODUCTION & MOTIVATION


Text-Rich Data-Graphs and Hybrid Queries

Increasing amount of semi-structured, text-rich data:

Structured data with
unstructured texts
(e.g., [1]).

Structure Unstructed data Text
annotated with structured
information (e.g., [2]).

[1] DBpedia – A Crystallization
Point for the Web of Data.

[2] http://webdatacommons.org.
4 Andreas Wagner, Veli Bicer, and Duc Thanh Tran Institute of Applied Informatics and Formal

Text-Rich Data-Graphs and Hybrid Queries (2)

Focus of our work: conjuctive, hybrid queries

relation attribute
?x ?y „keyword“

structured query predicates unstructured query predicates

„string“ (query) predicates

Structure Text


Problem Definition

Problem: Efficiently and effectively estimate the result set size
for a conjuctive, hybrid query Q.
[5] Selectivity estimation
Decompose problem: sel(Q) = R(Q) * P(Q), [5]. using probabilistic models.
R(Q): upper-bound cardinality for result set.
P(Q): probability for Q having an non-empty result.

Correlation between query predicates (data elements) make
approximation of P(Q) hard.

Correlations Correlations
relation attribute
?x relation ?y attribute „keyword“
relation attribute „keyword“
„keyword“

Correlations

Correlations make estimations relying on
6
„indepence assumptions“ error-prone ! Institute of Applied Informatics and Formal

Contributions

Previous works focuses either on structured or on unstructured
query constraints.

- Graph synopses [3] Correlations Correlations
- Join samples [4] ?x relation ?y attribute „keyword“
relation
relation „keyword“
„keyword“
- Fuzzy string matching [7,8]
- PRMs [5,6]
- Extraction operators [9,10]
-…
-…
Correlations

We introduce a uniform model (BN+) for hybrid queries:
Instance of template-based BN well-suited for graph-structed data.
Extend BN with string synopses for estimation of string predicates.


SELECTIVITY ESTIMATION FOR
TEXT-RICH DATA GRAPHS


Preliminaries (1) – Data and Query Model

Data Attribute
Class Value Node
Node
Bag of N-
Grams

Relation
Edge Attribute
Entity Node Edge

Query Relation
Predicate
Keyword
Node

contains String
Predicate


Preliminaries (2) – Bayesian Networks (1) sel(Q) = R(Q) * P(Q)
Recall:

Bayesian Network (BN) provides means for capturing joint
probability distributions (e.g., P(Q)).
BN comprise network structure and parameters.

Nodes = random variables.
Edges = dependencies .


Preliminaries (3) – Bayesian Networks (2)

BN comprise network structure and parameters.


Preliminaries (4) – Bayesian Networks (3)

Template-based BNs: templates and template factors [16].
Template is a function Χ(α1,…,αk), and each argument αi is a place-
holder to be instantiated to obtain random variables.
Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}.

Entity skeleton for Xperson = {p1,p2,p3} .

Template factors define probability distributions shared by all
instantiated random variables of a given template.

12 Shared by all instantiations of XdirectedBy Institute of Applied Informatics and Formal

Template-Based BN for Graph-structured Data

We define a templates for each …
Attribute a, Xa(α1). Entity skeleton: all entities having attribute a.
Class c, Xc(α1). Entity skeleton: all entities belonging to class c.
Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target”
entities having relation r.

Template for relation spouse.

Template for attribute title.
Template for class person.
- PRMs [5,6]
-…
Dynamic partitioning based on
Advantages
entity skeletons.

13 Template representation is compact. of Applied Informatics and Formal
Institute

- Fuzzy string matching [7,8]
- Extraction operators [9,10]
Integration of String Synopses (1) -…

Problem: Large sample space for attribute-based templates.

Entire n-gram space as Ω.

In order to compactly represent Ω, being a large set of strings, we
use string synopses (e.g., [7,8,9,10]).

Intuitively, for an attribute-based template a string synopsis does:
a) Decide how to “compactly represent” Ω.
b) Compute probabilities for strings given its compact space.

Some synopses even allow to “guess”
probabilities for unknown strings.


Integration of String Synopses (2)
[10] Selectivity estimation for
extraction operators over text data.

In this work, we use n-gram-based synopses [10].
Consider, e.g., top-k n-gram synopsis [10].
Compute n-gram counts and store only top-k n-grams.
Probabilities for known n-grams are exact.
Omitted n-grams are estimated based on heuristics using known n-
grams.


Learning of BN+ (1): Structure (1)
Similar technique has been [11] Approximating discrete
recently applied for “Lightweight probability distributions with
PRMs” [6]. dependence trees.

Simplify structure via product approximation using trees [11,12].

Fixed Structure Assumption:
a) Two templates X1 and X2 are conditionally independent given their
parents, if they do not share a common entity in their skeletons
b) Each class template Xc has no parent.
c) Each relation template Xr is independent of any class template Xc,
given its parents.


Learning of BN+ (2): Structure (2)
Template Model

Using fixed structure allows to decompose structure learning:
„Local“ correlations between attribute/class (e.g., Xmovie → Xtitle)
Reduce network structure to only capture “most important”
correlations via maximal spanning forest.
Relation templates connect different trees.

Overall, network structure is determined by „overlapping“ entity
skeletons and fixed structure assumption.


Learning of BN+ (3): Parameters

Based on the learned structure, parameters are learned via
collecting sufficient statistics (i.e., frequency counts).

Speed up parameter learning via:
Using queries to obtain sufficient statistics.
Using caching during structure / parameter learning.


Estimating P(Q) using BN+ (1)

At runtime, templates are instantiated to construct a query-
specific ground BN.
Template Model

Query
Query-specific Ground BN

19 Assignment is a string synopsis element. Institute of Applied Informatics and Formal

Recall: sel(Q) = R(Q) * P(Q)
Estimating P(Q) using BN+ (2)
Given a query-specific ground BN, we use inferencing to obtain the
joint probability P(Q).
Query-specific Ground BN

20 “Correction” using string synopsis.
Institute of Applied Informatics and Formal

EVALUATION


Evaluation (1) – Setting
Data: IMDB [14] and DBLP [15].
IMDB featured more correlations than DBLP.
Different results between DBLP and IMDB show „relative benefit“.

Queries: recent keyword search benchmarks [13,14] . We
employed 54 DBLP queries and 46 IMDB queries.
[13] Spark2: Top-k keyword
query in relational data-
Systems: bases.
We used n-gram-based string synopses [10]:
[14] A framework for
random samples of 1-grams, evaluating database key-
top-k 1-grams, word search strategies.
stratified bloom filters on 1-grams.
String predicates were integrated via (1) independence (ind) or (2)
conditional independence (bn) assumption.


Evaluation (2) – Setting (2)

Synopsis size:
Overall synopsis size depends mainly on string synopsis size.
Synopses sizes ∈ {2, 4, 20, 40} MByte memory.

Metrics:
Efficiency: selectivity estimation time.
Effectiveness: multiplicative error [17].

[17] Independence is good: De-
pendency-based histogram syno-
pses for high-dimensional data.


Evaluation (3) – Effectiveness – IMDB


Evaluation (4) – Effectiveness – DBLP


Evaluation (5) – Efficiency


CONCLUSION


Conclusion

Tackled the problem of selectivity estimation for conjunctive,
hybrid queries.
We propose a template-based BN, which is well-suited for
graph-structured data.
For string predicates, we further propose the integration of
string synopses into this model.
Experiments showed that:
If there are correlations between un-/structured data elements the
accuracy of selectivity estimation can be greatly improved via BN+.
BN caused no overhead in terms of efficiency.


QUESTIONS


REFERENCES


References
[1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal
of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7,
Pages 154–165, 2009.
[2] http://webdatacommons.org/
[3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for
approximate query answering. In SIGMOD, pages 275–286, 1999.
[4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity
estimation. In SIGMOD, pages 205–216, 2006.
[5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models.
In SIGMOD, pages 461–472, 2001.
[6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for
selectivity estimation without independence assumptions. PVLDB, 4(11):852–863,
2011.
[7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates:
Overcoming the underestimation problem. In ICDE, pages 227–238, 2004.
[8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In
VLDB, pages 397–408, 2005.


References (2)
[9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information
extraction using datalog with embedded extraction predicates. In VLDB, pages
1033–1044, 2007.
[10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for
extraction operators over text data. In ICDE, pages 685–696, 2011.
[11] C. Chow and C. Liu. Approximating discrete probability distributions with
dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968.
[12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine
Learning Research, 1:1–48, 2001.
[13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword
query in relational databases. IEEE Transactions on Knowledge and Data
Engineering, 23(12):1763–1780, 2011.
[14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword
search strategies. In CIKM, pages 729–738, 2010.
[15] http://knoesis.org/swetodblp/
[16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.
[17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good:
Dependency-based histogram synopses for high-dimensional data. In SIGMOD,
pages 199-210, 2001.

Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Ähnlich wie Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Hinweis der Redaktion