SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Selectivity Estimation for Hybrid Queries over Text-Rich
   Data Graphs

   Andreas Wagner, Veli Bicer, and Duc Thanh Tran
   EDBT/ICDT’13

Institute of Applied Informatics and Formal Description Methods (AIFB)




KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association                    www.kit.edu
Introduction and Motivation



    Selectivity Estimation for Text-Rich Data Graphs



                  Evaluation Results




2                                           Institute of Applied Informatics and Formal
                                                            Description Methods (AIFB)
INTRODUCTION & MOTIVATION


3                      Institute of Applied Informatics and Formal
                                       Description Methods (AIFB)
Text-Rich Data-Graphs and Hybrid Queries

      Increasing amount of semi-structured, text-rich data:




                    Structured data with
                    unstructured texts
                    (e.g., [1]).

    Structure                                    Unstructed data               Text
                                                 annotated with structured
                                                 information (e.g., [2]).




                                                                       [1] DBpedia – A Crystallization
                                                                       Point for the Web of Data.

                                                                       [2] http://webdatacommons.org.
4              Andreas Wagner, Veli Bicer, and Duc Thanh Tran                  Institute of Applied Informatics and Formal
                                                                                               Description Methods (AIFB)
Text-Rich Data-Graphs and Hybrid Queries (2)

      Focus of our work: conjuctive, hybrid queries


                relation             attribute
          ?x                ?y                    „keyword“



       structured query predicates    unstructured query predicates

                                           „string“ (query) predicates


                                           Structure                                           Text




5                                                                        Institute of Applied Informatics and Formal
                                                                                         Description Methods (AIFB)
Problem Definition

      Problem: Efficiently and effectively estimate the result set size
      for a conjuctive, hybrid query Q.
                                                                 [5] Selectivity estimation
          Decompose problem: sel(Q) = R(Q) * P(Q), [5].          using probabilistic models.
          R(Q): upper-bound cardinality for result set.
          P(Q): probability for Q having an non-empty result.

          Correlation between query predicates (data elements) make
          approximation of P(Q) hard.

                                     Correlations                             Correlations
                     relation          attribute
               ?x      relation ?y       attribute „keyword“
                         relation          attribute „keyword“
                                                     „keyword“

                                                Correlations

                Correlations make estimations relying on
6
                „indepence assumptions“ error-prone            !    Institute of Applied Informatics and Formal
                                                                                    Description Methods (AIFB)
Contributions

      Previous works focuses either on structured or on unstructured
      query constraints.


- Graph synopses [3]                          Correlations                   Correlations
- Join samples [4]     ?x   relation     ?y     attribute    „keyword“
                              relation
                                relation                       „keyword“
                                                                 „keyword“
                                                                             - Fuzzy string matching [7,8]
- PRMs [5,6]
                                                                             - Extraction operators [9,10]
-…
                                                                             -…
                                              Correlations



      We introduce a uniform model (BN+) for hybrid queries:
          Instance of template-based BN well-suited for graph-structed data.
          Extend BN with string synopses for estimation of string predicates.




7                                                                                   Institute of Applied Informatics and Formal
                                                                                                    Description Methods (AIFB)
SELECTIVITY ESTIMATION FOR
    TEXT-RICH DATA GRAPHS

8                      Institute of Applied Informatics and Formal
                                       Description Methods (AIFB)
Preliminaries (1) – Data and Query Model

       Data                                                              Attribute
                                      Class                             Value Node
                                      Node
                                                                         Bag of N-
                                                                          Grams



                  Relation
                   Edge                         Attribute
                             Entity Node         Edge




       Query                     Relation
                                 Predicate
                                                            Keyword
                                                             Node

    contains                                   String
                                              Predicate

9                                                                 Institute of Applied Informatics and Formal
                                                                                  Description Methods (AIFB)
Preliminaries (2) – Bayesian Networks (1) sel(Q) = R(Q) * P(Q)
                                         Recall:


       Bayesian Network (BN) provides means for capturing joint
       probability distributions (e.g., P(Q)).
       BN comprise network structure and parameters.




     Nodes = random variables.
     Edges = dependencies .




10                                                       Institute of Applied Informatics and Formal
                                                                         Description Methods (AIFB)
Preliminaries (3) – Bayesian Networks (2)

       BN comprise network structure and parameters.




11                                                     Institute of Applied Informatics and Formal
                                                                       Description Methods (AIFB)
Preliminaries (4) – Bayesian Networks (3)

       Template-based BNs: templates and template factors [16].
          Template is a function Χ(α1,…,αk), and each argument αi is a place-
          holder to be instantiated to obtain random variables.
          Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}.



         Entity skeleton for Xperson = {p1,p2,p3} .




          Template factors define probability distributions shared by all
          instantiated random variables of a given template.




12                Shared by all instantiations of XdirectedBy    Institute of Applied Informatics and Formal
                                                                                 Description Methods (AIFB)
Template-Based BN for Graph-structured Data

       We define a templates for each …
          Attribute a, Xa(α1). Entity skeleton: all entities having attribute a.
          Class c, Xc(α1). Entity skeleton: all entities belonging to class c.
          Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target”
          entities having relation r.




                                                            Template for relation spouse.



Template for attribute title.
                                Template for class person.
                                                                             - PRMs [5,6]
                                                                             -…
                                Dynamic partitioning based on
        Advantages
                                entity skeletons.

13                              Template representation is compact. of Applied Informatics and Formal
                                                                Institute
                                                                                   Description Methods (AIFB)
- Fuzzy string matching [7,8]
                                                        - Extraction operators [9,10]
     Integration of String Synopses (1)                 -…

       Problem: Large sample space for attribute-based templates.


                Entire n-gram space as Ω.



       In order to compactly represent Ω, being a large set of strings, we
       use string synopses (e.g., [7,8,9,10]).

       Intuitively, for an attribute-based template a string synopsis does:
        a) Decide how to “compactly represent” Ω.
        b) Compute probabilities for strings given its compact space.

                          Some synopses even allow to “guess”
                             probabilities for unknown strings.


14                                                            Institute of Applied Informatics and Formal
                                                                              Description Methods (AIFB)
Integration of String Synopses (2)
                                                   [10] Selectivity estimation for
                                                   extraction operators over text data.


       In this work, we use n-gram-based synopses [10].
       Consider, e.g., top-k n-gram synopsis [10].
           Compute n-gram counts and store only top-k n-grams.
           Probabilities for known n-grams are exact.
           Omitted n-grams are estimated based on heuristics using known n-
           grams.




15                                                              Institute of Applied Informatics and Formal
                                                                                Description Methods (AIFB)
Learning of BN+ (1): Structure (1)
                           Similar technique has been          [11] Approximating discrete
                           recently applied for “Lightweight   probability distributions with
                           PRMs” [6].                          dependence trees.


       Simplify structure via product approximation using trees [11,12].

       Fixed Structure Assumption:
        a) Two templates X1 and X2 are conditionally independent given their
           parents, if they do not share a common entity in their skeletons
        b) Each class template Xc has no parent.
        c) Each relation template Xr is independent of any class template Xc,
           given its parents.




16                                                                   Institute of Applied Informatics and Formal
                                                                                     Description Methods (AIFB)
Learning of BN+ (2): Structure (2)
 Template Model




       Using fixed structure allows to decompose structure learning:
           „Local“ correlations between attribute/class (e.g., Xmovie → Xtitle)
           Reduce network structure to only capture “most important”
           correlations via maximal spanning forest.
           Relation templates connect different trees.

       Overall, network structure is determined by „overlapping“ entity
       skeletons and fixed structure assumption.


17                                                                Institute of Applied Informatics and Formal
                                                                                  Description Methods (AIFB)
Learning of BN+ (3): Parameters




       Based on the learned structure, parameters are learned via
       collecting sufficient statistics (i.e., frequency counts).

       Speed up parameter learning via:
           Using queries to obtain sufficient statistics.
           Using caching during structure / parameter learning.




18                                                                Institute of Applied Informatics and Formal
                                                                                  Description Methods (AIFB)
Estimating P(Q) using BN+ (1)

       At runtime, templates are instantiated to construct a query-
       specific ground BN.
     Template Model




           Query
                                                 Query-specific Ground BN




19              Assignment is a string synopsis element.   Institute of Applied Informatics and Formal
                                                                           Description Methods (AIFB)
Recall: sel(Q) = R(Q) * P(Q)
     Estimating P(Q) using BN+ (2)
       Given a query-specific ground BN, we use inferencing to obtain the
       joint probability P(Q).
                           Query-specific Ground BN




20                                         “Correction” using string synopsis.
                                                           Institute of Applied Informatics and Formal
                                                                                Description Methods (AIFB)
EVALUATION


21                Institute of Applied Informatics and Formal
                                  Description Methods (AIFB)
Evaluation (1) – Setting
       Data: IMDB [14] and DBLP [15].
          IMDB featured more correlations than DBLP.
          Different results between DBLP and IMDB show „relative benefit“.


       Queries: recent keyword search benchmarks [13,14] . We
       employed 54 DBLP queries and 46 IMDB queries.
                                                           [13] Spark2: Top-k keyword
                                                           query in relational data-
       Systems:                                            bases.
          We used n-gram-based string synopses [10]:
                                                           [14] A framework for
              random samples of 1-grams,                   evaluating database key-
              top-k 1-grams,                               word search strategies.
              stratified bloom filters on 1-grams.
          String predicates were integrated via (1) independence (ind) or (2)
          conditional independence (bn) assumption.



22                                                             Institute of Applied Informatics and Formal
                                                                               Description Methods (AIFB)
Evaluation (2) – Setting (2)

       Synopsis size:
          Overall synopsis size depends mainly on string synopsis size.
          Synopses sizes ∈ {2, 4, 20, 40} MByte memory.


       Metrics:
          Efficiency: selectivity estimation time.
          Effectiveness: multiplicative error [17].




                                                      [17] Independence is good: De-
                                                      pendency-based histogram syno-
                                                      pses for high-dimensional data.


23                                                               Institute of Applied Informatics and Formal
                                                                                 Description Methods (AIFB)
Evaluation (3) – Effectiveness – IMDB




24                                           Institute of Applied Informatics and Formal
                                                             Description Methods (AIFB)
Evaluation (4) – Effectiveness – DBLP




25                                           Institute of Applied Informatics and Formal
                                                             Description Methods (AIFB)
Evaluation (5) – Efficiency




26                                 Institute of Applied Informatics and Formal
                                                   Description Methods (AIFB)
CONCLUSION


27                Institute of Applied Informatics and Formal
                                  Description Methods (AIFB)
Conclusion

      Tackled the problem of selectivity estimation for conjunctive,
      hybrid queries.
      We propose a template-based BN, which is well-suited for
      graph-structured data.
      For string predicates, we further propose the integration of
      string synopses into this model.
      Experiments showed that:
          If there are correlations between un-/structured data elements the
          accuracy of selectivity estimation can be greatly improved via BN+.
          BN caused no overhead in terms of efficiency.




28                                                             Institute of Applied Informatics and Formal
                                                                               Description Methods (AIFB)
QUESTIONS


29               Institute of Applied Informatics and Formal
                                 Description Methods (AIFB)
REFERENCES


30                Institute of Applied Informatics and Formal
                                  Description Methods (AIFB)
References
     [1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal
         of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7,
         Pages 154–165, 2009.
     [2] http://webdatacommons.org/
     [3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for
         approximate query answering. In SIGMOD, pages 275–286, 1999.
     [4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity
         estimation. In SIGMOD, pages 205–216, 2006.
     [5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models.
         In SIGMOD, pages 461–472, 2001.
     [6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for
         selectivity estimation without independence assumptions. PVLDB, 4(11):852–863,
         2011.
     [7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates:
         Overcoming the underestimation problem. In ICDE, pages 227–238, 2004.
     [8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In
         VLDB, pages 397–408, 2005.



31                                                                             Institute of Applied Informatics and Formal
                                                                                               Description Methods (AIFB)
References (2)
     [9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information
         extraction using datalog with embedded extraction predicates. In VLDB, pages
         1033–1044, 2007.
     [10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for
         extraction operators over text data. In ICDE, pages 685–696, 2011.
     [11] C. Chow and C. Liu. Approximating discrete probability distributions with
         dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968.
     [12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine
         Learning Research, 1:1–48, 2001.
     [13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword
         query in relational databases. IEEE Transactions on Knowledge and Data
         Engineering, 23(12):1763–1780, 2011.
     [14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword
         search strategies. In CIKM, pages 729–738, 2010.
     [15] http://knoesis.org/swetodblp/
     [16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.
     [17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good:
         Dependency-based histogram synopses for high-dimensional data. In SIGMOD,
         pages 199-210, 2001.
32                                                                          Institute of Applied Informatics and Formal
                                                                                            Description Methods (AIFB)

Weitere ähnliche Inhalte

Was ist angesagt?

RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization
RELIN: Relatedness and Informativeness-based Centrality for Entity SummarizationRELIN: Relatedness and Informativeness-based Centrality for Entity Summarization
RELIN: Relatedness and Informativeness-based Centrality for Entity SummarizationGong Cheng
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
IRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET Journal
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyIJwest
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
A Graph-based Model for Multimodal Information Retrieval
A Graph-based Model for Multimodal Information RetrievalA Graph-based Model for Multimodal Information Retrieval
A Graph-based Model for Multimodal Information Retrievalserwah_S_gh
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...Hiroshi Ono
 
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Thanh Tran
 
Multilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsMultilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsIDES Editor
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...Seokhwan Kim
 
MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)STI International
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
 

Was ist angesagt? (20)

RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization
RELIN: Relatedness and Informativeness-based Centrality for Entity SummarizationRELIN: Relatedness and Informativeness-based Centrality for Entity Summarization
RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
IRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP Techniques
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontology
 
E43022023
E43022023E43022023
E43022023
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
A Graph-based Model for Multimodal Information Retrieval
A Graph-based Model for Multimodal Information RetrievalA Graph-based Model for Multimodal Information Retrieval
A Graph-based Model for Multimodal Information Retrieval
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
 
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
 
Multilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsMultilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random Forests
 
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...
 
MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)MOST (Newsfromthefront 2010)
MOST (Newsfromthefront 2010)
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSPATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHS
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 

Andere mochten auch

MOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLPMOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLPAnkita Jadhao
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Frank Oellien
 
Khoury ashg2014
Khoury ashg2014Khoury ashg2014
Khoury ashg2014muink
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Toolsaiaioo
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...NextMove Software
 
Network biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text miningNetwork biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text miningLars Juhl Jensen
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining Bhawi247
 
Pingar - The Future of Text Analytics
Pingar - The Future of Text AnalyticsPingar - The Future of Text Analytics
Pingar - The Future of Text AnalyticsChris Riley ☁
 
Text Analytics Past, Present & Future
Text Analytics Past, Present & FutureText Analytics Past, Present & Future
Text Analytics Past, Present & FutureSeth Grimes
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platformFayan TAO
 
Text Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersText Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersSeth Grimes
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and VisualizationSeth Grimes
 
Large-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effectsLarge-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effectsLars Juhl Jensen
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewSeth Grimes
 
Des applications plus intelligentes
Des applications plus intelligentesDes applications plus intelligentes
Des applications plus intelligentesalfallouji
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionFlorian Leitner
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processingCreditas
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAnálisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAlex Rayón Jerez
 

Andere mochten auch (20)

MOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLPMOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLP
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
 
Khoury ashg2014
Khoury ashg2014Khoury ashg2014
Khoury ashg2014
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...
 
Network biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text miningNetwork biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text mining
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 
Pingar - The Future of Text Analytics
Pingar - The Future of Text AnalyticsPingar - The Future of Text Analytics
Pingar - The Future of Text Analytics
 
Text Analytics Past, Present & Future
Text Analytics Past, Present & FutureText Analytics Past, Present & Future
Text Analytics Past, Present & Future
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platform
 
Text Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersText Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and Providers
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Large-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effectsLarge-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effects
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry View
 
Applied text mining
Applied text miningApplied text mining
Applied text mining
 
Des applications plus intelligentes
Des applications plus intelligentesDes applications plus intelligentes
Des applications plus intelligentes
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: Introduction
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAnálisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
 

Ähnlich wie Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Data Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering GroupData Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering GroupOscar Corcho
 
Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshopPaolo Missier
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringWaqas Nawaz
 
Crowdsourcing tasks in Linked Data management
Crowdsourcing tasks in Linked Data managementCrowdsourcing tasks in Linked Data management
Crowdsourcing tasks in Linked Data managementBarry Norton
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAbhishek Mungoli
 
Crowdsourcing-enabled Linked Data management architecture
Crowdsourcing-enabled Linked Data management architectureCrowdsourcing-enabled Linked Data management architecture
Crowdsourcing-enabled Linked Data management architectureElena Simperl
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic
 
Extending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context AwarenessExtending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context AwarenessVictor Codina
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxNeo4j
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
 
10.1.1.70.8789
10.1.1.70.878910.1.1.70.8789
10.1.1.70.8789Hoài Bùi
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbenchEuropean Data Forum
 
Image-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationImage-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationIJwest
 

Ähnlich wie Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs (20)

A Description Method for Scientific Data Based on KOS
A Description Method for Scientific Data Based on KOS A Description Method for Scientific Data Based on KOS
A Description Method for Scientific Data Based on KOS
 
Data Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering GroupData Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering Group
 
Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshop
 
Declarative analysis of noisy information networks
Declarative analysis of noisy information networksDeclarative analysis of noisy information networks
Declarative analysis of noisy information networks
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph Clustering
 
Crowdsourcing tasks in Linked Data management
Crowdsourcing tasks in Linked Data managementCrowdsourcing tasks in Linked Data management
Crowdsourcing tasks in Linked Data management
 
An Algebra of Hierarchical Graphs
An Algebra of Hierarchical GraphsAn Algebra of Hierarchical Graphs
An Algebra of Hierarchical Graphs
 
Cg4201552556
Cg4201552556Cg4201552556
Cg4201552556
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: Simrank
 
Crowdsourcing-enabled Linked Data management architecture
Crowdsourcing-enabled Linked Data management architectureCrowdsourcing-enabled Linked Data management architecture
Crowdsourcing-enabled Linked Data management architecture
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
 
Extending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context AwarenessExtending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context Awareness
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
 
10.1.1.70.8789
10.1.1.70.878910.1.1.70.8789
10.1.1.70.8789
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
Image-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationImage-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data Integration
 

Kürzlich hochgeladen

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

  • 1. Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs Andreas Wagner, Veli Bicer, and Duc Thanh Tran EDBT/ICDT’13 Institute of Applied Informatics and Formal Description Methods (AIFB) KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
  • 2. Introduction and Motivation Selectivity Estimation for Text-Rich Data Graphs Evaluation Results 2 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 3. INTRODUCTION & MOTIVATION 3 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 4. Text-Rich Data-Graphs and Hybrid Queries Increasing amount of semi-structured, text-rich data: Structured data with unstructured texts (e.g., [1]). Structure Unstructed data Text annotated with structured information (e.g., [2]). [1] DBpedia – A Crystallization Point for the Web of Data. [2] http://webdatacommons.org. 4 Andreas Wagner, Veli Bicer, and Duc Thanh Tran Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 5. Text-Rich Data-Graphs and Hybrid Queries (2) Focus of our work: conjuctive, hybrid queries relation attribute ?x ?y „keyword“ structured query predicates unstructured query predicates „string“ (query) predicates Structure Text 5 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 6. Problem Definition Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q. [5] Selectivity estimation Decompose problem: sel(Q) = R(Q) * P(Q), [5]. using probabilistic models. R(Q): upper-bound cardinality for result set. P(Q): probability for Q having an non-empty result. Correlation between query predicates (data elements) make approximation of P(Q) hard. Correlations Correlations relation attribute ?x relation ?y attribute „keyword“ relation attribute „keyword“ „keyword“ Correlations Correlations make estimations relying on 6 „indepence assumptions“ error-prone ! Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 7. Contributions Previous works focuses either on structured or on unstructured query constraints. - Graph synopses [3] Correlations Correlations - Join samples [4] ?x relation ?y attribute „keyword“ relation relation „keyword“ „keyword“ - Fuzzy string matching [7,8] - PRMs [5,6] - Extraction operators [9,10] -… -… Correlations We introduce a uniform model (BN+) for hybrid queries: Instance of template-based BN well-suited for graph-structed data. Extend BN with string synopses for estimation of string predicates. 7 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 8. SELECTIVITY ESTIMATION FOR TEXT-RICH DATA GRAPHS 8 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 9. Preliminaries (1) – Data and Query Model Data Attribute Class Value Node Node Bag of N- Grams Relation Edge Attribute Entity Node Edge Query Relation Predicate Keyword Node contains String Predicate 9 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 10. Preliminaries (2) – Bayesian Networks (1) sel(Q) = R(Q) * P(Q) Recall: Bayesian Network (BN) provides means for capturing joint probability distributions (e.g., P(Q)). BN comprise network structure and parameters. Nodes = random variables. Edges = dependencies . 10 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 11. Preliminaries (3) – Bayesian Networks (2) BN comprise network structure and parameters. 11 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 12. Preliminaries (4) – Bayesian Networks (3) Template-based BNs: templates and template factors [16]. Template is a function Χ(α1,…,αk), and each argument αi is a place- holder to be instantiated to obtain random variables. Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}. Entity skeleton for Xperson = {p1,p2,p3} . Template factors define probability distributions shared by all instantiated random variables of a given template. 12 Shared by all instantiations of XdirectedBy Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 13. Template-Based BN for Graph-structured Data We define a templates for each … Attribute a, Xa(α1). Entity skeleton: all entities having attribute a. Class c, Xc(α1). Entity skeleton: all entities belonging to class c. Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target” entities having relation r. Template for relation spouse. Template for attribute title. Template for class person. - PRMs [5,6] -… Dynamic partitioning based on Advantages entity skeletons. 13 Template representation is compact. of Applied Informatics and Formal Institute Description Methods (AIFB)
  • 14. - Fuzzy string matching [7,8] - Extraction operators [9,10] Integration of String Synopses (1) -… Problem: Large sample space for attribute-based templates. Entire n-gram space as Ω. In order to compactly represent Ω, being a large set of strings, we use string synopses (e.g., [7,8,9,10]). Intuitively, for an attribute-based template a string synopsis does: a) Decide how to “compactly represent” Ω. b) Compute probabilities for strings given its compact space. Some synopses even allow to “guess” probabilities for unknown strings. 14 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 15. Integration of String Synopses (2) [10] Selectivity estimation for extraction operators over text data. In this work, we use n-gram-based synopses [10]. Consider, e.g., top-k n-gram synopsis [10]. Compute n-gram counts and store only top-k n-grams. Probabilities for known n-grams are exact. Omitted n-grams are estimated based on heuristics using known n- grams. 15 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 16. Learning of BN+ (1): Structure (1) Similar technique has been [11] Approximating discrete recently applied for “Lightweight probability distributions with PRMs” [6]. dependence trees. Simplify structure via product approximation using trees [11,12]. Fixed Structure Assumption: a) Two templates X1 and X2 are conditionally independent given their parents, if they do not share a common entity in their skeletons b) Each class template Xc has no parent. c) Each relation template Xr is independent of any class template Xc, given its parents. 16 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 17. Learning of BN+ (2): Structure (2) Template Model Using fixed structure allows to decompose structure learning: „Local“ correlations between attribute/class (e.g., Xmovie → Xtitle) Reduce network structure to only capture “most important” correlations via maximal spanning forest. Relation templates connect different trees. Overall, network structure is determined by „overlapping“ entity skeletons and fixed structure assumption. 17 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 18. Learning of BN+ (3): Parameters Based on the learned structure, parameters are learned via collecting sufficient statistics (i.e., frequency counts). Speed up parameter learning via: Using queries to obtain sufficient statistics. Using caching during structure / parameter learning. 18 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 19. Estimating P(Q) using BN+ (1) At runtime, templates are instantiated to construct a query- specific ground BN. Template Model Query Query-specific Ground BN 19 Assignment is a string synopsis element. Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 20. Recall: sel(Q) = R(Q) * P(Q) Estimating P(Q) using BN+ (2) Given a query-specific ground BN, we use inferencing to obtain the joint probability P(Q). Query-specific Ground BN 20 “Correction” using string synopsis. Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 21. EVALUATION 21 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 22. Evaluation (1) – Setting Data: IMDB [14] and DBLP [15]. IMDB featured more correlations than DBLP. Different results between DBLP and IMDB show „relative benefit“. Queries: recent keyword search benchmarks [13,14] . We employed 54 DBLP queries and 46 IMDB queries. [13] Spark2: Top-k keyword query in relational data- Systems: bases. We used n-gram-based string synopses [10]: [14] A framework for random samples of 1-grams, evaluating database key- top-k 1-grams, word search strategies. stratified bloom filters on 1-grams. String predicates were integrated via (1) independence (ind) or (2) conditional independence (bn) assumption. 22 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 23. Evaluation (2) – Setting (2) Synopsis size: Overall synopsis size depends mainly on string synopsis size. Synopses sizes ∈ {2, 4, 20, 40} MByte memory. Metrics: Efficiency: selectivity estimation time. Effectiveness: multiplicative error [17]. [17] Independence is good: De- pendency-based histogram syno- pses for high-dimensional data. 23 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 24. Evaluation (3) – Effectiveness – IMDB 24 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 25. Evaluation (4) – Effectiveness – DBLP 25 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 26. Evaluation (5) – Efficiency 26 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 27. CONCLUSION 27 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 28. Conclusion Tackled the problem of selectivity estimation for conjunctive, hybrid queries. We propose a template-based BN, which is well-suited for graph-structured data. For string predicates, we further propose the integration of string synopses into this model. Experiments showed that: If there are correlations between un-/structured data elements the accuracy of selectivity estimation can be greatly improved via BN+. BN caused no overhead in terms of efficiency. 28 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 29. QUESTIONS 29 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 30. REFERENCES 30 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 31. References [1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009. [2] http://webdatacommons.org/ [3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275–286, 1999. [4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. In SIGMOD, pages 205–216, 2006. [5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461–472, 2001. [6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB, 4(11):852–863, 2011. [7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages 227–238, 2004. [8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, pages 397–408, 2005. 31 Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 32. References (2) [9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033–1044, 2007. [10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In ICDE, pages 685–696, 2011. [11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968. [12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine Learning Research, 1:1–48, 2001. [13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases. IEEE Transactions on Knowledge and Data Engineering, 23(12):1763–1780, 2011. [14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies. In CIKM, pages 729–738, 2010. [15] http://knoesis.org/swetodblp/ [16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009. [17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199-210, 2001. 32 Institute of Applied Informatics and Formal Description Methods (AIFB)

Hinweis der Redaktion

  1. Queries contains query predicates for structured and unstructured query constraints Resembling SPARQL queries with FILTER contains function unstructured query predicates = string predicates
  2. * However, effective estimation of P(Q) is important for query optimizers relying on accurate estimates for intermediate query results.
  3. Graphical representation of a set of cond. IndsExpress a factorization of the joint distribution
  4. * Given a template X(α1,…, αn), an entity skeleton of X is defined as E(α1, . . . , αn) ⊆ E(α1) × … × E(αn),where each E(αi) ⊆ VE specifies all possible entity assignments to αi .
  5. In a relational context, data is stored in tables corresponding to relations captured by a conceptual model. Further, relation names are explicitly given in a query – stated in a FROM clause. Correspondingly, previous works [10, 23] employ a PRM to model selection predicates through randomvariables of the form XR.A, where R is a relational table andA is an attribute. For instance, XPerson.name = “Audrey” is a random variable capturing a selection on table Person where name equals “Audrey”. Analogously, join predicates are modeled as binary random variables that involve two explicitly specified tables. Further, schema information may be queried via class predicates, which are not supported in the relational setting.
  6. Inferencing costs are driven by two factors: (1) dependency structure of a BN, and (2) sample space sizes. Existing works on PRMs have focused on the former, targeting a lightweight, tree-shaped BN structure [23]. The latter aspect, however, is crucial as CPD sizes are a mere reflection of sample space sizes. Essentially, for supporting string predicates with all possible keywords, Ω(Xa) must capture allwords and phrases, which occur in a’s values. In order to compactly represent Ω, being a large set of strings, we propose the use of string synopses such as Markov tables [4], histograms [13] or n-gram synopses [25].
  7. Then, the space Ba is reduced by using a decision criterion to dictate which n-grams ∈ Ba to include in a synopsis sample space Ω(Xa ). That is, a synopsis space represents a subset of “important” n-grams. Note, n-gram synopses are most accurate, as each synopsis element represents exactly one n-gram ∈ Ba – in contrast to, e.g., histograms. Recent work has outlined several such decision criteria [25].
  8. Recently applied for PRMs [6].We impose that strong correlations among templates only occur, if they share some common entities – they need to “talk about the same things” (Def. 2-a). We argue that there is a causal dependence (independence) between a class and an attribute (relation) template (Def. 2-b, -c). In other words, assigningan entity to a given class causally affects the probability of its attribute values, which in turn, influences the probability of observing a particular relation
  9. Using fixed structure allows to decompose structure learning: First learn “local” correlations between attribute/class template..Reduce network structure to only capture “most important” correlations via maximal spanning forest.Connect forest of trees via relational templates.
  10. Such a template-based approach has the merit of being compact. The number of templates is far less than the number of random variables in a ground BN. Structure and parameters (CPDs) are learned for templates only. At runtime, templates are instantiated with entities to construct a ground BN. For inferencing, a CPD learned for a template is shared among all random variables in the ground BN that instantiate that template.
  11. Missing Synopsis Values Multiple Value Assignments
  12. DBLP as well as IMDB hold text-rich attributes like name, label or info. However, IMDB contains more text. Strong correlations in IMDB data between/among text and/or structure. In particular, we noticed strong dependencies during structure learning between values of attributes such as label and info. Hypothesis that assuming independence hurts the quality of selectivity estimates, given datasets that exhibit correlations. We also used DBLP, which on the other hand, shows almost no such correlations. Using DBLP data, we expect accuracy differences to be less significant. Our workload includes queries containing [2, 11] predicates in total: [0, 4] relation, [1, 7] string, and [1, 4] class predicates (cf. Tab. 2).
  13. Key factor driving overall synopsis size was employed string synopsis.Experiments were run on a Linux server with two Intel Xeon 5140 CPUs (each with 2 cores at 2.33GHz), 48GB RAM (with 16GB assigned to the JVM), and a RAID10 with IBM SAS 148GB 10k rpm disks. Before query execution, all OS caches were cleared.
  14. sel(Q) and sel(Q) as exact and estimated selectivity for Q, respectively. Intuitively, me represents the factor at which sel(Q) under-/overestimates sel(Q). Best accuracy results were achieved by ind∗ and bn∗ having a size ≥ 20 MByte, Further, the results confirmed our conjecture that the degree of data correlations has a significant impact on the overall accuracy performance differences between ind∗and bn∗ approaches. That is, a high degree of correlation in the IMDB dataset translated to large accuracy differences, while the improvement bn∗ could achieve over the baseline was small for DBLP. For the IMDB dataset, bnsbf could reduce errors of the indsbf approach by 93 %, while improvements were much smaller given DBLP. We noticed the error to increase in the number of predicates. This effect is expected, as more query predicates (hence more “difficult” queries) lead to an increasingly error-prone probability estimation. An interesting observation is that ind∗ outperformed bn∗ for some queries – see IMDB queries with 5 predicates and DBLP queries with 4 predicates (Fig. 4-b and -f). For instance, given IMDB query Q28, indtop-k achieved 13% better results than bntop-k. In such cases, string query predicates were translated to multiple values (1-grams) that are assigned to one single random variable
  15. For instance, for DBLP queries with string predicates name and label, there are no significant correlations in ourBN. Thus, the probabilities obtained by bn∗ were almost identical to the However, while ind∗ led to fairly good estimates for the overall query load on DBLP, we could achieve more accurate selectivity computations via bn∗ for specific “correlated” queries. For instance, for DBLP query Q1 we could approximate an 10% better selectivity estimation.