SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Semantic Web Search
         Searching Documents and Semantic Data on the Web
         Presentation at Information Sciences Institute, USC
Semantic Search Group at the AIFB Institute
Thanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner,
Veli Bicer, Yongtao Ma and Rudi Studer.

http://sites.google.com/site/kimducthanh




    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1
Structure


         • Motivation
         • Previous and current work
         • Keyword query processing
         • Keyword query result ranking
         • Conclusion




    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
2
Besides documents, there is an increasing amount of structured data on
          the Web such as RDF, RDFa and Linked Data! How can we leverage this
          for enhancing the search experience?

          MOTIVATION


    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
3
RDFa
     …
     <div about="/alice/posts/trouble_with_bob">
         <h2 property="dc:title">The trouble with Bob</h2>
         <h3 property="dc:creator">Alice</h3>

                             Bob is a good friend of mine. We went to the same university, and
                             also shared an apartment in Berlin in 2008. The trouble with Bob is
                             that he takes much better photos than I do:

         <div about="http://example.com/bob/photos/sunset.jpg">
          <img src="http://example.com/bob/photos/sunset.jpg" />
          <span property="dc:title">Beautiful Sunset</span>
          by <span property="dc:creator">Bob</span>.
         </div>
     </div>
     …
                                                                                            adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/



    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
4
RDFa

Bob is a good friend of mine. We         content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
                                 content




                                                                                                adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
5
Semantic Data




                                                                                                source: http://linkeddata.org/
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
6
Linked Data




                                                                                                adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
7
Addressing Complex Information Needs
     “Information about a friend of Alice, who shared an apartment with
      her in Berlin and knows someone in the field of Semantic Search
      working at KIT”.


                                                    <shared apartment in Berlin with Alice>                                 <knows someone in
                                                                                                                            the field of Semantic
                                                                  <friend of Alice>                                         Search working at KIT>
                                                trouble with bob                                                    FluidOps                     34
                                                                                                                                 Peter
                                                                                                 sunset.jpg
                                                Bob is a good friend
                                                                                                                Beautiful
                                                of mine. We went to                                             Sunset
                                                the same university,                                                         Germany     Semantic
                                                                                                 Alice                                   Search
                                                and also shared an
                                                apartment in Berlin
                                                in 2008. The trouble
                                                with Bob is that he                                                                    Germany    2009
                                                                                                          Bob
                                                takes much better                                                      Thanh
                                                photos than I do:
                                                                                                                                KIT
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
9
Data Sources in SemanticSearch@AIFB Demo

      English Wikipedia

      Data from Linked Open Data
                 DBpedia
                 YAGO
                 Many more


      Live data from Data.gov (US Government)
                 E.g. live data about earthquakes


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
10
Search Intent Interpretation, Refinement
             and Exploration                Keywords




                                                                                                          Query
                                                                                                          Completions

                                                                                                            Term
                                                                                                            Completions




                                     Facets
Vorlesung Knowledge Discovery - Institut AIFB




              KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
    13
Result Inspection, Analysis and Browsing




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
14
OVERVIEW OF WORK


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
15
Search Concepts
      Hybrid Search: Structured queries combined with
       keywords on structured and unstructured data in
       possibly remote (Linked Data) sources
                                                                                                 BACK-END


      Query interpretation: Translation of keywords to
       hybrid queries

      Keyword search (translated hybrid query)
       combined with faceted search: starting with
       keywords and then iterative refinement process
       based on operations on facets
                                                                                                 FRONT-END

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
16
Previous and Current Work

      Semi-structured RDF data management [ISWC09] [TKDE12]
                 Inverted index for RDF data management
                 Structure index
      Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11]
                 Keyword query routing to find relevant sources / relevant
                  combination of sources
                 “Explorative” query processing and adaptive query optimization
                 Combining local and remote Linked Data
      Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11]
                 Ontology and entity result summarization
                 Faceted and keyword search
      Current work: hybrid data search

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
17   Tran Thanh: Schema-agnostic Search
KEYWORD QUERY PROCESSING
           [ICDE09]
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
18
DB-style Keyword Search
     Keyword query processing / translation
“Articles of researchers at Stanford with Turing Award”                                          „Stanford      Article   Turing Award“

                                                                               Specification




                     Keywords might produce large number of
                      matching elements in the data graph
                     The data graph might be large in size
                     Search complexity increases substantially with
                      the size of the graph
                     Large number of results

     Selection                             Set of Queries                                                     Set of Results
                                 1) Query 1                                                             1) Result 1
                                 2) Query 2                                                             2) Result 2

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
19
Query Space
     Schema graph                                                                            Query space




           Main Idea
      Exploration on much reduced the data graph model
         Query space: more compact representation of
                                                          summary
      Online constructionspace space out of schema graph
       called query of query
         Match keywords against labels of resources to find keyword elements
      Substantially elements with elements of schema to obtain query space
         Connect keyword decrease complexity

      Top-k procedure for graph exploration to compute
      Online top-k query graph exploration

       only top-k results
      KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
20
Top-k Query Graph Exploration on Query Space
Paths and their costs                                                                                The resulting query graph




     •       Cost-directed exploration of Steiner graphs
     •       Explore all possible distinct paths starting from keyword elements
     •       At each exploration, take current path with lowest cost
     •       When a connecting element is found, merge paths to construct the query
             graph and add it to candidate list
     •       Top-k terminates when highest cost of the candidate list (the cost of the k-
             ranked query graph) is found to be lower than the lowest possible cost that can
             achieved with paths in the queues yet to be explored
     •       Result: best k query interpretations to be shown to the user

         KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
21
Evaluation – Performance
     • Comparison with bidirectional search [V. Kacholia et al.] and
       search based on graph indexing (1000 BFS, 1000 METIS, 300
       BFS, 300 METIS in [H. He et al.])
     • Query computation + processing time until finding 10 answers
     • Outperforms bidirectional search by at least one order of magn.
     • Performance comparable with indexing based approaches, but
       requires less space
        100000
           10000                                                                                                          Our Solution

               1000                                                                                                       Bidirect
                                                                                                                          1000 BFS
                  100
                                                                                                                          1000 METIS
                     10                                                                                                   300BFS
                        1                                                                                                 300METIS
                                    Q1            Q2             Q3            Q4            Q5   Q6   Q7   Q8   Q9 Q10
                                                              Query Performance on DBLP Data
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
22
KEYWORD QUERY RESULT RANKING
           [CIKM11]
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
23
IR-based Ranking Schemes
      TF*IDF based:
                 Discover, EASE, SPARK
                 [Liu et al, SIGMOD06]

     Score( JRT )                                            Score( r )
                                                 r JRT

     Score(r )                                    Weight (v, r ) Weight (v, Q)
                                       v r ,Q

                                                                                                                  ntf
                                                                                                 Weight (v, r )       nidf
        ntf             1 ln(1 ln(tf ))
                                                                                                                  ndl

        ndl             (1 s) s dl / avdl
                                   N 1
        nidf               ln
                                    df                                                            24


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

24
Proximity-based Ranking Schemes

      EASE, XRANK, BLINKS, etc.
      EASE
                 Proximity between a pair of keywords




                 Overall score of a JRT is aggregation on the score of keyword pairs
      XRANK
                 Ranking of XML documents / elements
                 Proximity here is defined based on w, the smallest text window in
                  n that contains all search keywords



     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
25
Prestige-based Ranking Schemes

      Based on graph structure, i.e. PageRank-like
       methods to determine node prestige
                 XRank [Guo et al, SIGMOD03]
                 ObjectRank [Balmin et al, VLDB04] : considers both
                  global ObjectRank and keyword-specific ObjectRank
                 The probability that edges of different types will be
                  visited are not uniform: requires manual fine-tuning to
                  set the importance of different types of edges
                 Naive: indegree




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
26
Introduction
      Recent study shows that the effectiveness of most
       works are below the expectations (Coffman and Weaver,
           CIKM 2010)
      Problems:
               Proximity does not directly model relevance
               Ad-hoc TF/IDF normalization does not capture the nature
                of keyword search results well (small document length,
                skewed word occurrence statistics)
               PageRank not directly applicable




      KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
27
Overview of the Approach

      Keyword query is short an ambiguous, while data
       (and results) provide rich structure information
       that can be exploited!
      Principled approach to relevance based on
       language models and PRF  estimate model from
       content and structure of PRF results
      Adopt relevance model as a fine-grained model
       representing both content and structure of
       relevant document and queries (relevance class)


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
28
Relevance Models [SIGIR 01]
      Explicit notion of relevance
      Queries and documents are samples from a latent
       representation space, i.e. the relevance model underlying
       the information need




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
29
Relevance Models
                                                                   q1            Israeli
                                                                                                                       sample probabilities
                                                                                                                       P(w|Q)           w
                                 M                                 q2            Palestinian                              .077 palestinian
                                                                                                                          .055 israel
                                 M                                 q3            raids                                    .034 jerusalem
                                 M                                                                                        .033 protest
                                                                   w               ???                                    .027 raid
                                                                                                                          .011 clash
                                                                                                                          .010 bank
                                                                                 P( w, q1...qk )                          .010 west
     P( w | R)                      P( w | q1...qk )                                                                      .010 troop
                                                                                  P(q1...qk )
                                                                                                                                …

                                                                                                   k
     P ( w, q1...qk )                                            P( M ) P( w | M )                       P (qi | M )
                                                M UM                                               i 1


       KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
30
Ranking with Relevance Models

      Probability ranking principle
                                                   P( D | R)                                 P( w | R)
                                                   P( D | N )                         w    D P( w | N )


      See relevance model as query expansion
                 Rank of document is based on the cross-entropy of its
                  model and the relevance model

                                      H ( R || D)                                  P ( w | R) log P( w | D)
                                                                          w V


                                                                           n( w, D)
                                 P( w | D)                           D              (1                D   ) P( w | C )
                                                                             |D|

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
31
Edge-Specific Relevance Models
            Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted
             keyword index:
                       E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}
            Based on PRF results, an edge specific relevance model is constructed for each unique
             edge e based on:




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
32
Edge Specific Resource Models

      Edge-specific resource model:


                 Smoothing with model for the entire resource
      The score of a resource calculated based on cross-entropy
       of edge-specific RM and edge-specific ResM:




                 Alpha allows to control the importance of edges




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
33
Ranking JRTs
      Ranking aggregated JRTs:
                 The cross entropy between the edge-specific RM (Query Model) and
                  geometric mean of combined edge-specific ResM:




      The proposed ranking function is monotonic with respect to the
       individual resource scores (a necessary property for using top-k
       algorithms)


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
34
Experiments
      Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases
      Queries: 50 queries for each dataset including “TREC style” queries and
       “single resource” queries
      Metrics: Three metrics are used: (1) the number of top-1 relevant
       results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)
      Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,
       CoveredDensity (TF-IDF).
      RM-S: Our approach




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
35
Experiments – Single Resource Queries
     -       Proximity-based approaches perform well
     -       Minimizing compactness results in single resources being ranked high
     -       TF-IDF normalization not as aggressive, not as effective




                                          Reciprocal rank for single resource queries
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
36
Experiments – TREC-style Queries
     -       TF-IDF based approaches performed better
     -       Our approach outperformed existing approaches also in this category,
             providing more stable performance over the entire precision-recall curve




                                         Precision-recall for TREC-style queries on Wikipedia
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
37
Experiment – All Queries

     - Our approach consistently shows superior performance
     - Encouraging, given that this is first study that use a general
       framework for evaluating keyword search ranking




                                                                    MAP scores for all queries
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
38
Conclusions / Future Work

      Front-to-backend work on using structured data for
       enhancing the search experience
      From backend data management to frontend search
       concepts
      Current work / future directions
                 Managing hybrid data
                 Hybrid query processing / interfaces
                 Ranking hybrid results




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
39
References (1)
            Günter Ladwig, Thanh Tran
             SIHJoin: Querying Remote and Local Linked Data
             In 8th Extended Semantic Web Conference (ESWC'11). Heraklion, Greece, June, 2011 (full
             research paper, 23% acceptance rate).
            Thanh Tran, Lei Zhang, Rudi Studer
             Summary Models for Routing Keywords to Linked Data Sources
             In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
             China, November, 2010 (full research paper, 20% acceptance rate).
            Günter Ladwig, Thanh Tran
             Linked Data Query Processing Strategies
             In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
             China, November, 2010 (full research paper, 20% acceptance rate).
            Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer
             Ontology-based Interpretation of Keywords for Semantic Search
             In Proceedings of the 6th International Semantic Web Conference (ISWC'07), pp. 523-
             536. Busan, Korea, November 2007 (full paper, 19% acceptance rate).




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
40
References (2)
            Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
             Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
             In Proceedings of the 25th International Conference on Data Engineering
             (ICDE'09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).
            Haofen Wang, Duc Thanh Tran, Chang Liu
             CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support
             In Proceedings of the 17th Conference on Information and Knowledge Management
             (CIKM'08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).
            Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu,
             Yue Pan
             Semplore: A Scalable IR Approach to Search the Web of Data
             In Journal of Web Semantics, 2009 (Impact Factor 3.4).
            Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu
             Snippet Generation for Semantic Web Search Engines
             In Proceedings of the 3rd Asian Semantic Web Conference (ASWC'08). December
             2008 (full research paper, 31% acceptance rate).
            Thanh Tran, Günter Ladwig
             Structure Index for RDF
             In SemData@VLDB Workshop (SemData'10). Singapore, September, 2010.

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
41
Thanks!




                                                                                                                            Tran Duc Thanh
                                                                                                                     ducthanh.tran@kit.edu
                                                                                                 http://sites.google.com/site/kimducthanh/


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
42
Backups




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
43
        Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search
         over relational databases. In ICDE, pages 5-16.
        Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and
         opportunities. In VLDB, page 1368.
        Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance
         oriented ranking. In ICDE, pages 517-528.
        Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching
         and Browsing in Databases using BANKS. In ICDE, pages 431-440.
        Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using
         Relevance Models. In CIKM.
        Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009):
         DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165
        Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data
         graphs. PVLDB, 1(1):1189-1204.
        Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost
         connected trees in databases. In ICDE, pages 836-845.
        Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data
         graphs. In SIGMOD, pages 927-940.
        Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search
         over XML documents. In SIGMOD.
        He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In
         SIGMOD, pages 305-316.
        Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in
         databases. ACM Trans. Database Syst., 33(1):1-40

    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
        Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases.
         In VLDB.
        Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005).
         Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.
        Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword
         proximity search. In PODS, pages 173-182.
        Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword
         Search Databases. In CIKM.
        Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127.
        Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search
         method for unstructured, semi-structured and structured data. In SIGMOD.
        Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational
         databases. In SIGMOD, pages 563-574.
        Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational
         databases. In SIGMOD, pages 115-126.
        Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD,
         pages 681-694.
        Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across
         heterogeneous relational databases. In ICDE, pages 346-355.
        Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search
         Process. In Journal of Web Semantics, 2011.
        Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates
         for Efficient Keyword Search on RDF. In ICDE.
        Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over
         relational databases. In VLDB.
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

Weitere ähnliche Inhalte

Andere mochten auch

Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в ИталиюEasyWays
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Summary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesSummary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesThanh Tran
 
Graphinder semantic search
Graphinder semantic searchGraphinder semantic search
Graphinder semantic searchThanh Tran
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing StrategiesThanh Tran
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesThanh Tran
 
Big data search
Big data search Big data search
Big data search Thanh Tran
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesThanh Tran
 
Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsThanh Tran
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebThanh Tran
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриковAndronovaAnna
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...Thanh Tran
 

Andere mochten auch (12)

Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в Италию
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Summary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesSummary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data Sources
 
Graphinder semantic search
Graphinder semantic searchGraphinder semantic search
Graphinder semantic search
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing Strategies
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search Databases
 
Big data search
Big data search Big data search
Big data search
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the Web
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриков
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
 

Kürzlich hochgeladen

Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Kürzlich hochgeladen (20)

Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 

Semantic Web Search - Searching Documents and Semantic Data on the Web

  • 1. Semantic Web Search Searching Documents and Semantic Data on the Web Presentation at Information Sciences Institute, USC Semantic Search Group at the AIFB Institute Thanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner, Veli Bicer, Yongtao Ma and Rudi Studer. http://sites.google.com/site/kimducthanh KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 1
  • 2. Structure • Motivation • Previous and current work • Keyword query processing • Keyword query result ranking • Conclusion KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 2
  • 3. Besides documents, there is an increasing amount of structured data on the Web such as RDF, RDFa and Linked Data! How can we leverage this for enhancing the search experience? MOTIVATION KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 3
  • 4. RDFa … <div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div> </div> … adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 4
  • 5. RDFa Bob is a good friend of mine. We content went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 5
  • 6. Semantic Data source: http://linkeddata.org/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 6
  • 7. Linked Data adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 7
  • 8. Addressing Complex Information Needs  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”. <shared apartment in Berlin with Alice> <knows someone in the field of Semantic <friend of Alice> Search working at KIT> trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same university, Germany Semantic Alice Search and also shared an apartment in Berlin in 2008. The trouble with Bob is that he Germany 2009 Bob takes much better Thanh photos than I do: KIT KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 9
  • 9. Data Sources in SemanticSearch@AIFB Demo  English Wikipedia  Data from Linked Open Data  DBpedia  YAGO  Many more  Live data from Data.gov (US Government)  E.g. live data about earthquakes KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 10
  • 10. Search Intent Interpretation, Refinement and Exploration Keywords Query Completions Term Completions Facets Vorlesung Knowledge Discovery - Institut AIFB KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 13
  • 11. Result Inspection, Analysis and Browsing KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 14
  • 12. OVERVIEW OF WORK KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 15
  • 13. Search Concepts  Hybrid Search: Structured queries combined with keywords on structured and unstructured data in possibly remote (Linked Data) sources BACK-END  Query interpretation: Translation of keywords to hybrid queries  Keyword search (translated hybrid query) combined with faceted search: starting with keywords and then iterative refinement process based on operations on facets FRONT-END KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 16
  • 14. Previous and Current Work  Semi-structured RDF data management [ISWC09] [TKDE12]  Inverted index for RDF data management  Structure index  Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11]  Keyword query routing to find relevant sources / relevant combination of sources  “Explorative” query processing and adaptive query optimization  Combining local and remote Linked Data  Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11]  Ontology and entity result summarization  Faceted and keyword search  Current work: hybrid data search KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 17 Tran Thanh: Schema-agnostic Search
  • 15. KEYWORD QUERY PROCESSING [ICDE09] KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 18
  • 16. DB-style Keyword Search Keyword query processing / translation “Articles of researchers at Stanford with Turing Award” „Stanford Article Turing Award“ Specification  Keywords might produce large number of matching elements in the data graph  The data graph might be large in size  Search complexity increases substantially with the size of the graph  Large number of results Selection Set of Queries Set of Results 1) Query 1 1) Result 1 2) Query 2 2) Result 2 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 19
  • 17. Query Space Schema graph Query space  Main Idea  Exploration on much reduced the data graph model  Query space: more compact representation of summary  Online constructionspace space out of schema graph called query of query  Match keywords against labels of resources to find keyword elements  Substantially elements with elements of schema to obtain query space  Connect keyword decrease complexity  Top-k procedure for graph exploration to compute  Online top-k query graph exploration only top-k results KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 20
  • 18. Top-k Query Graph Exploration on Query Space Paths and their costs The resulting query graph • Cost-directed exploration of Steiner graphs • Explore all possible distinct paths starting from keyword elements • At each exploration, take current path with lowest cost • When a connecting element is found, merge paths to construct the query graph and add it to candidate list • Top-k terminates when highest cost of the candidate list (the cost of the k- ranked query graph) is found to be lower than the lowest possible cost that can achieved with paths in the queues yet to be explored • Result: best k query interpretations to be shown to the user KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 21
  • 19. Evaluation – Performance • Comparison with bidirectional search [V. Kacholia et al.] and search based on graph indexing (1000 BFS, 1000 METIS, 300 BFS, 300 METIS in [H. He et al.]) • Query computation + processing time until finding 10 answers • Outperforms bidirectional search by at least one order of magn. • Performance comparable with indexing based approaches, but requires less space 100000 10000 Our Solution 1000 Bidirect 1000 BFS 100 1000 METIS 10 300BFS 1 300METIS Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Query Performance on DBLP Data KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 22
  • 20. KEYWORD QUERY RESULT RANKING [CIKM11] KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 23
  • 21. IR-based Ranking Schemes  TF*IDF based:  Discover, EASE, SPARK  [Liu et al, SIGMOD06] Score( JRT ) Score( r ) r JRT Score(r ) Weight (v, r ) Weight (v, Q) v r ,Q ntf Weight (v, r ) nidf ntf 1 ln(1 ln(tf )) ndl ndl (1 s) s dl / avdl N 1 nidf ln df 24 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 24
  • 22. Proximity-based Ranking Schemes  EASE, XRANK, BLINKS, etc.  EASE  Proximity between a pair of keywords  Overall score of a JRT is aggregation on the score of keyword pairs  XRANK  Ranking of XML documents / elements  Proximity here is defined based on w, the smallest text window in n that contains all search keywords KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 25
  • 23. Prestige-based Ranking Schemes  Based on graph structure, i.e. PageRank-like methods to determine node prestige  XRank [Guo et al, SIGMOD03]  ObjectRank [Balmin et al, VLDB04] : considers both global ObjectRank and keyword-specific ObjectRank  The probability that edges of different types will be visited are not uniform: requires manual fine-tuning to set the importance of different types of edges  Naive: indegree KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 26
  • 24. Introduction  Recent study shows that the effectiveness of most works are below the expectations (Coffman and Weaver, CIKM 2010)  Problems:  Proximity does not directly model relevance  Ad-hoc TF/IDF normalization does not capture the nature of keyword search results well (small document length, skewed word occurrence statistics)  PageRank not directly applicable KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 27
  • 25. Overview of the Approach  Keyword query is short an ambiguous, while data (and results) provide rich structure information that can be exploited!  Principled approach to relevance based on language models and PRF  estimate model from content and structure of PRF results  Adopt relevance model as a fine-grained model representing both content and structure of relevant document and queries (relevance class) KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 28
  • 26. Relevance Models [SIGIR 01]  Explicit notion of relevance  Queries and documents are samples from a latent representation space, i.e. the relevance model underlying the information need KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 29
  • 27. Relevance Models q1 Israeli sample probabilities P(w|Q) w M q2 Palestinian .077 palestinian .055 israel M q3 raids .034 jerusalem M .033 protest w ??? .027 raid .011 clash .010 bank P( w, q1...qk ) .010 west P( w | R) P( w | q1...qk ) .010 troop P(q1...qk ) … k P ( w, q1...qk ) P( M ) P( w | M ) P (qi | M ) M UM i 1 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 30
  • 28. Ranking with Relevance Models  Probability ranking principle P( D | R) P( w | R) P( D | N ) w D P( w | N )  See relevance model as query expansion  Rank of document is based on the cross-entropy of its model and the relevance model H ( R || D) P ( w | R) log P( w | D) w V n( w, D) P( w | D) D (1 D ) P( w | C ) |D| KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 31
  • 29. Edge-Specific Relevance Models  Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted keyword index:  E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}  Based on PRF results, an edge specific relevance model is constructed for each unique edge e based on: KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 32
  • 30. Edge Specific Resource Models  Edge-specific resource model:  Smoothing with model for the entire resource  The score of a resource calculated based on cross-entropy of edge-specific RM and edge-specific ResM:  Alpha allows to control the importance of edges KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 33
  • 31. Ranking JRTs  Ranking aggregated JRTs:  The cross entropy between the edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:  The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms) KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 34
  • 32. Experiments  Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases  Queries: 50 queries for each dataset including “TREC style” queries and “single resource” queries  Metrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)  Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).  RM-S: Our approach KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 35
  • 33. Experiments – Single Resource Queries - Proximity-based approaches perform well - Minimizing compactness results in single resources being ranked high - TF-IDF normalization not as aggressive, not as effective Reciprocal rank for single resource queries KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 36
  • 34. Experiments – TREC-style Queries - TF-IDF based approaches performed better - Our approach outperformed existing approaches also in this category, providing more stable performance over the entire precision-recall curve Precision-recall for TREC-style queries on Wikipedia KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 37
  • 35. Experiment – All Queries - Our approach consistently shows superior performance - Encouraging, given that this is first study that use a general framework for evaluating keyword search ranking MAP scores for all queries KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 38
  • 36. Conclusions / Future Work  Front-to-backend work on using structured data for enhancing the search experience  From backend data management to frontend search concepts  Current work / future directions  Managing hybrid data  Hybrid query processing / interfaces  Ranking hybrid results KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 39
  • 37. References (1)  Günter Ladwig, Thanh Tran SIHJoin: Querying Remote and Local Linked Data In 8th Extended Semantic Web Conference (ESWC'11). Heraklion, Greece, June, 2011 (full research paper, 23% acceptance rate).  Thanh Tran, Lei Zhang, Rudi Studer Summary Models for Routing Keywords to Linked Data Sources In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).  Günter Ladwig, Thanh Tran Linked Data Query Processing Strategies In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).  Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer Ontology-based Interpretation of Keywords for Semantic Search In Proceedings of the 6th International Semantic Web Conference (ISWC'07), pp. 523- 536. Busan, Korea, November 2007 (full paper, 19% acceptance rate). KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 40
  • 38. References (2)  Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).  Haofen Wang, Duc Thanh Tran, Chang Liu CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support In Proceedings of the 17th Conference on Information and Knowledge Management (CIKM'08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).  Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu, Yue Pan Semplore: A Scalable IR Approach to Search the Web of Data In Journal of Web Semantics, 2009 (Impact Factor 3.4).  Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu Snippet Generation for Semantic Web Search Engines In Proceedings of the 3rd Asian Semantic Web Conference (ASWC'08). December 2008 (full research paper, 31% acceptance rate).  Thanh Tran, Günter Ladwig Structure Index for RDF In SemData@VLDB Workshop (SemData'10). Singapore, September, 2010. KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 41
  • 39. Thanks! Tran Duc Thanh ducthanh.tran@kit.edu http://sites.google.com/site/kimducthanh/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 42
  • 40. Backups KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 43
  • 41. Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search over relational databases. In ICDE, pages 5-16.  Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and opportunities. In VLDB, page 1368.  Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance oriented ranking. In ICDE, pages 517-528.  Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching and Browsing in Databases using BANKS. In ICDE, pages 431-440.  Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using Relevance Models. In CIKM.  Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009): DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165  Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data graphs. PVLDB, 1(1):1189-1204.  Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost connected trees in databases. In ICDE, pages 836-845.  Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data graphs. In SIGMOD, pages 927-940.  Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search over XML documents. In SIGMOD.  He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pages 305-316.  Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in databases. ACM Trans. Database Syst., 33(1):1-40 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
  • 42. Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In VLDB.  Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.  Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173-182.  Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword Search Databases. In CIKM.  Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127.  Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD.  Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational databases. In SIGMOD, pages 563-574.  Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational databases. In SIGMOD, pages 115-126.  Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD, pages 681-694.  Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across heterogeneous relational databases. In ICDE, pages 346-355.  Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search Process. In Journal of Web Semantics, 2011.  Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF. In ICDE.  Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over relational databases. In VLDB. KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

Hinweis der Redaktion

  1. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Information need interpreted as a set of constrains Match structured data Match text
  2. Togive an impressionwherewearetowardsaccomplishingthisgoal: demofirstOurcurrentsystem: Support theprocessofaddressingcomplexinformationneeds: startswithkeywordsearch: intepretingthequeryintentandthenbrowsing / exploration / refinementofresultsset via facetedsearch
  3. - Upon selecting a specificresult: resource-basenavigation (insteadoffacetedbased)
  4. TF-idf are used to deal with the textual part of the dataPropose to also exploit the structure of keyword search resultsProximity-based ranking employ minimal distance heuristics to maximize structural compactness of results When JRT is more compact, it is assumed to be more meaningful and relevant Intuition: keyword specified by the users are closely related and thus should be connected over relatively short paths I.e. Compactness measured in terms of the length of paths between nodes, i.e. The proximity The larger the length of paths, the less relevant is the overall resultNi and nj are nodes in the graph sim(ni,nj) denotes the compactness between two any nodessim(ki,kj) denotes the compactness between two keywords (taking account the compactness of all pairs of nodes matching the two keywords), i.e. Cki denotes the set of all nodes that match kiOverall score of a JRT is an aggregation on the score of its
  5. Schemas = summaries