SlideShare ist ein Scribd-Unternehmen logo
1 von 77
Semantic Search
   Focus: IR on Structured Data

8th European Summer School on Information Retrieval
Duc Thanh Tran
Institute AIFB, KIT, Germany
Tran@aifb.uni-karlsruhe.de
http://sites.google.com/site/kimducthanh
Agenda

 Why Semantic Search?
 What is Semantic Search?
 A Semantic Search direction - IR on structured data
   Matching
   Ranking
 Conclusions
Why Semantic Search?
Why Semantic Search?
                                              Many of these queries would not
     Solve main classes of queries, e.g. navigational
                                              be asked by users, who learned
                                              over time what search
     But long tail queries…                  technology can and can not do.

       “teacher math class Goethe”
                                           These queries require precise
     Several problematic cases            understanding of the underlying
                                           information needs and data, and
       Ambiguous / imprecise        queries
                                           aggregating results.
         “Paris Hilton”
         “strong adventures people from Germany”
       Specific, complex queries (factual, aggregated)
         “32 year old computer scientist living in Karlsruhe”
         “digital camera under 300 dollars produced by canon in 1992”




4
Why Semantic Search?
     Towards a Semantic Web
     Large number of Web data vocabularies published in
     RDFS and OWL
       Schema.org
       Dbpedia ontology
     Large amounts of data published in RDF / RDFa
       Linked Data
       Embedded metadata       Semantics captured by
                                taxonomies, ontologies, structur
                                ed metadata can help to obtain
                                precise understanding, to
                                aggregate information from
                                different sources, and to
                                retrieve relevant results!

5
Vocabularies
     DBpedia ontology




         from : http://wiki.dbpedia.org/Ontology




6
Vocabularies
    DBpedia   [Bizer et al, JWS02]




                                     from : http://wiki.dbpedia.org/Ontology
7
from : http://wiki.dbpedia.org/Ontology




8
Structured Data
       Resource Description Framework (RDF)

     Each resource (thing, entity) is identified by a URI
     Entity descriptions as sets of facts
       Triples of (subject, predicate, object)
     A set of triples is published together in an RDF
     document (forming an RDF graph)




                                adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
9
Structured Data
     Linked Data




                       source: http://linkeddata.org/
10
Metadata
     RDFa on the rise


              510% increase
              between
              March, 2009 and
              October, 2010




     Percentage of URLs with embedded metadata in various formats
11                   from : http://www.slideshare.net/pmika/semtech-2011-semantic-search-tutorial
Metadata
        RDFa
     …
     <div about="/alice/posts/trouble_with_bob">
         <h2 property="dc:title">The trouble with Bob</h2>
         <h3 property="dc:creator">Alice</h3>

             Bob is a good friend of mine. We went to the same university, and
             also shared an apartment in Berlin in 2008. The trouble with Bob is
             that he takes much better photos than I do:

         <div about="http://example.com/bob/photos/sunset.jpg">
          <img src="http://example.com/bob/photos/sunset.jpg" />
          <span property="dc:title">Beautiful Sunset</span>
          by <span property="dc:creator">Bob</span>.
         </div>
     </div>
     …
                                     adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/




12
Metadata
       RDFa


Bob is a good friend of mine. We         content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
                                 content




                                       adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/

13
What is Semantic Search?
Structure
 Semantics
 Search tasks
   Document, data, social media, multimedia
 Core search problems
 Semantic search exploits semantics
   For search tasks
   For search problems
 Many Semantic Search directions
Semantics
      Semantics is concerned with the meaning of query,
       data and background knowledge
      Distributional hypothesis / statistical semantics
        “a word is characterized by the company it keeps”
        Based on word patterns (co-occurrence frequency of the
        context words near a given target word)
      Explicit semantics
        Various explicit representations of meaning




16
Explicit Semantics
      Linguistic models: relationships among terms
        Term relationships: synonymous, hyponymous, broader, narrower…
        Taxonomies, thesauri, dictionaries of entity names
        Examples: WordNet, Roget’s Thesaurus
      Conceptual models: relationships among classes of objects
        Abstract and conceptual representation of data
        Concepts, RDFS classes, associations, relationships, attributes…
        Terminological part (T-Box) of ontologies, DB schema e.g. relational
         model
        Examples: SUMO, DBpedia
      Structured data: relationships among objects
        Description of concrete objects
        Tuples, instances, entities, RDF resources, foreign keys, relationships,
         attributes,…
        Assertional part of ontologies (A-Box), DB instance
        Examples: Linked Data, metadata
17
Search tasks – document retrieval
      Search on textual data (documents, Web pages)
      Mainly studied in the IR community
      Data and queries
        Term-based representation
      Search algorithms
        Retrieve documents relevant for query keywords
        Match query term against terms / content of documents
        Leverage statistical semantics for dealing with
         ambiguity and for ranking
        Optimized, work well for navigational, topical search
        Less so for complex information needs
        Web scale


18
Search tasks – data retrieval
      Focus on structured data and retrieve direct answers
      Data and queries
        Structured models
      Search algorithms
        Retrieve direct answers that match structured queries
        Structure matching: term / content based relevance
         less the focus, but structure filtering based on joins
        Use relational semantics in structured data
        Optimized for complex structured information needs /
         queries, less so for text-based relevance
        More complex processing  efficiency, scalability


19
Search tasks
         Addressing complex information needs
                  Combination of data and document retrieval

      Movies directed by Stephen       Structured data with
       Spielberg where synopsis         textual attribute values
       mentions dinosaurs.              (content, description)
      Publications authored by 32
       year old computer scientist      Documents with
       living in Karlsruhe, which       metadata
       mention Semantic Search
      Information about a friend of
       Alice, who shared an apartment
       with her in Berlin and knows
       someone in the field of
       Semantic Search working at KIT
20
Search tasks
        e.g. combination of data and document retrieval
  “Information about a friend of Alice, who shared an apartment
     with her in Berlin and knows someone in the field of Semantic
     Search working at KIT”.


                  <shared apartment in Berlin with Alice>           <knows someone in
                                                                    the field of Semantic
                         <friend of Alice>                          Search working at KIT>
                 trouble with bob                           FluidOps                     34
                                                                         Peter
                                         sunset.jpg
                 Bob is a good friend
                                                        Beautiful
                 of mine. We went to                    Sunset
                 the same                                            Germany     Semantic
                                         Alice                                   Search
                 university, and also
                 shared an
                 apartment in Berlin
                 in 2008. The trouble                                          Germany    2009
                                                  Bob
                 with Bob is that he                           Thanh
                 takes much better
                 photos than I do:                                      KIT

21
Core search problems
        Term ambiguity
 apartment shared Berlin Alice                                   knows someone works at KIT



                     trouble with bob                           FluidOps                    34
                                                                            Peter
                                             sunset.jpg
                     Bob is a good friend
                                                            Beautiful
                     of mine. We went to                    Sunset
                     the same university,                               Germany     Semantic
                                             Alice                                  Search
                     and also shared an
                     apartment in Berlin
                     in 2008. The trouble
                     with Bob is that he                                          Germany    2009
                                                      Bob
                     takes much better                             Thanh
                     photos than I do:
                                                                           KIT




                                        Syntax / Semantic


               Is “BerllinNN” same as “Berlin”?      What is meant by “KIT”?
22
Core search problems
         Structure ambiguity
 apartment shared Berlin Alice                      knows someone works at KIT



                     trouble with bob                           FluidOps                    34
                                                                            Peter
                                            sunset.jpg
                     Bob is a good friend                   Beautiful
                     of mine. We went to                    Sunset                  Semantic
                     the same university,                               Germany
                                            Alice                                   Search
                     and also shared an
                     apartment in Berlin
                     in 2008. The trouble
                                                                                  Germany    2009
                     with Bob is that he              Bob
                                                                   Thanh
                     takes much better
                     photos than I do:
                                                                           KIT

                                                            Explicit semantics in
                                                            structured data reduces
                                                            structure ambiguity

     What is the connection between “Berlin”        What is the relationship between
     and “Alice”?                                   “someone” and KIT?
23
Core search problems
          Content ambiguity
 apartment shared Berlin Alice                                         knows someone works at KIT



                       trouble with bob                               FluidOps                    34
                                                                                  Peter
                                                   sunset.jpg
                       Bob is a good friend
                                                                  Beautiful
                       of mine. We went to                        Sunset
                       the same                                               Germany     Semantic
                                                   Alice                                  Search
                       university, and also
                       shared an
                       apartment in Berlin
                       in 2008. The trouble                                             Germany    2009
                                                            Bob
                       with Bob is that he                               Thanh
                       takes much better
                       photos than I do:                                         KIT

                                                                  Understanding of
                                                                  term and structure
                                                                  in content helps!

     Is the document about Berlin (as a city)?        Is the graph about “apartment shared Berlin
     Is the element’s content / label about KIT?      Alice knows someone works at KIT”?
24
Core search problems
      Dealing with ambiguities: matching and ranking
 2 scenarios:
                                              Ranked: ambiguities
 ambiguity (IR) vs.    Exact
 no ambiguity (DB)                            in query and data
                           Complete          representation 
                           Sound             results cannot be
     Query                                    guaranteed to exactly
                                              match the query (i.e.
                         Approximate         multiple interpretations
                            Not complete
       Matching




                                              lead to multiple non-
                            Not sound        equivalent matches).
                            Both the above

                                              Ambiguities at level of
                         Ranked              elements (term, content)
      Data                  Matching +       and relationships
                             ranking          between elements
                            Top-k            (structure)

     Matching mainly focuses on efficiency of computing matches
25   whereas ranking deals with degree of matching (relevance)!
Search
  “Information about a friend of Alice, who shared an apartment with her in Berlin
     and knows someone in the field of Semantic Search working at KIT”.



    Term: which Berlin?                                          Distributional semantics /
    Content: which documents are about Berlin?                   statistical reasoning over
                                                                  topic models, language
                                                                  models




                                                  Bob is a good friend of mine. We
                                                  went to the same university, and
                                                  also shared an apartment in Berlin
                                                  in 2008. The trouble with Bob is
                                                  that he takes much better photos
                                                  than I do:


26
Semantic Search
  “Information about a friend of Alice, who shared an apartment with her in Berlin
     and knows someone in the field of Semantic Search working at KIT”.



      Structure: friend of, knows, shares                      Relational semantics of
                                                                structured data in
                                                                various datasets




                                                                Bob
                                                Bob is a good friend of mine. We
                                                went to the same university, and
                                                also shared an apartment in Berlin
                                                in 2008. The trouble with Bob is
                                                that he takes much better photos
                                                than I do:


27
Semantic Search
  “Information about a friend of Alice, who shared an apartment with her in Berlin
      and knows someone in the field of Semantic Search working at KIT”.
      Term: creator is a subclass of person                    Semantics captured in
                                                                conceptual models, e.g.
                                                                class subsumption,
                                                                instance classification
                                                                (logic-based reasoning)



                                                                            picture
                                                                      creator
                                                       person


                                                                Bob
                                                Bob is a good friend of mine. We
                                                went to the same university, and
                                                also shared an apartment in Berlin
                                                in 2008. The trouble with Bob is
                                                that he takes much better photos
                                                than I do:


28
Semantic Search
  “Information about a friend of Alice, who shared an apartment with her in Berlin
      and knows someone in the field of Semantic Search working at KIT”.
      Term: image is synonym of picture                 Semantics captured in
                                                         linguistic models, e.g.
                                                         reasoning over term relationships

                                                     drawing
                                                                      image          picture

                                                                                               poster
                                                                            picture
                                                                      creator
                                                       person



                                                                Bob

                                                Bob is a good friend of mine. We
                                                went to the same university, and
                                                also shared an apartment in Berlin
                                                in 2008. The trouble with Bob is
                                                that he takes much better photos
                                                than I do:

29
Semantic Search
      A retrieval paradigm that exploits semantics of
        data, query, background knowledge                 [Tran et al, JWS11]
       to interpret and incorporate the intent of query and the meaning of data
         into the search algorithms (more generally: search process)
      Different directions, employing various models of semantics of
        Terms (linguistic models)
        Concepts (conceptual models)
       to deal with ambiguous queries
        Relational information (structured data) in
        Different datasets
       to produce complex structured, aggregated results to answer complex
         information needs
      Orthogonal to retrieval tasks / specific type of approaches
        Document retrieval
        Data retrieval
        Multimedia retrieval
30      Social media retrieval, multilingual retrieval…
Semantic Search                                                          ????
                   Semantic Search   Information      Data Retrieval   Multimedia
                                     Retrieval (IR)   (DB)             Retrieval
Keyword query            X                   X                x

Structured query          x                                  X

Textual data             X                   X                x

Structured data          X                   x               X

Conceptual model         X                                   X

Linguistic model         X                   x

Term matching            X                   X                x

Structure                X                                   X
matching
Content matching         X                   X

Ranking                  X                   X                x
Focus of the following technical parts
    IR on structured data

• Motivation
  • IR is user-centric!
  • Text-based querying paradigms more intuitive for end-users!
  • Keyword search widely adopted!
• Focus
  • Keyword query on structured data, i.e. “a direction” of
    semantic search, which employs semantics of
     Relational information (structured data) in
     Different datasets
    to produce complex structured, aggregated results to answer
    complex information needs
 Similar, complementary to DB keyword search tutorial,
  emphasizes      [Chen et al, SIGMOD09]
   The role of textual data: data graphs with textual content nodes
   The role of semantics
   Ranking
Matching
Structure
      Keyword search: keywords over data graphs
       Term matching
       Content matching
       Structure matching
      Schema-based keyword search
      Schema-agnostic keyword search
       Online search algorithms
       Index-based approaches




34
Keyword search approaches
      Finding “substructures” matching keyword nodes
      Different result semantics for different types of data
        Textual data (Web pages connected via hyperlink)
        DB (tuple connected via foreign keys)
        XML (elements/attributes via parent-child edges)
      Commonly used results: Steiner tree / subgraph
        Connect keyword matching elements
        Contain one keyword matching element for every query keyword
        Minimal substructures: closely connected keyword nodes
      Query is ambiguous, lacks explicit structure constraints
        NP-hard, thus efficiency of matching is a problem
        Large amounts on candidate matches, thus ranking is a problem


35
Keyword search on hybrid data graphs
     Example information need
     “Information about a friend of Alice, who shared an apartment with
     her in Berlin and knows someone working at KIT.”


 apartment shared Berlin Alice                     knows someone works at KIT
                                        Term
                                        matching
                     trouble with bob                            FluidOps                    34
                                                                             Peter
                                              sunset.jpg
                     Bob is a good friend
                                                             Beautiful
                     of mine. We went to                     Sunset
                     the same university,                                Germany     Semantic
                                              Alice                                  Search
                     and also shared an
                     apartment in Berlin
                          Content
                     in 2008. The trouble
                          matching                                                 Germany
                     with Bob is that he               Bob                                    2009
                     takes much better                              Thanh
                     photos than I do:
                                                             Structure      KIT
                                                             matching
36
Term matching
      Distance-based (syntax)
        Levenshtein distance (edit distance)
        Hamming distance
        Jaro-Winkler distance
      Dictionary-based (semantics)
        Taxonomy                         Term matching via reasoning
        Dictionary of similar words      over term relationships, e.g.
                                          image is synonym of picture
        Translation memory
        Ontologies
                        Term matching via reasoning
                        over concepts, e.g. creator is
                        a subclass of person



37
Content matching
     • Retrieve partial matches
        • Inverted list (inverted index)
               ki  {< d1, pos, score, ...>,
                     < d2, pos, score, ...>, ...}
     • Combine partial matches: union or join

shared            Berlin         Alice              shared   Berlin   Alice


 D1               D1            D1


     shared   =    berlin   =    alice




     shared




38
Structure matching                                      Structure not explicitly
                                                                  given in query  exploration
   • Retrieve structured data given patterns                     (e.g. triple patterns)
                                                                  / other kinds of join
      • Index on tables
      • Multiple “redundant” indexes to cover different access patterns
   • Combine: union or join
      • Blocking, e.g. linear merge join (required sorted input)
      • Non-blocking, e.g. symmetric hash-join
      • Materialized join indexes
                                                                 ?x ns:knows ?y. ?x ns:knows ?z.
Per1 ns:works ?v ?v ns:name “KIT”
      SP-index PO-index                                          ?z ns: works ?v. ?v ns:name “KIT”




                                                 =
                                       =                 =

                           Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
 39
Structure
      Keyword search: keywords over data graphs
        Term matching
        Content matching
        Structure matching
      Schema-based keyword search
      Schema-agnostic keyword search
        Online search algorithms
        Index-based approaches




40
Matching in keyword search – schema-based
      Operate on schema graph [Hristidis et al, VLDB02]          [Agrawal et al, ICDE02]
                                                        Linguistic semantics for
      Query interpretation    [Tran et al, ICDE09]
                                                        term matching, conceptual
        Compute queries instead of results             and relational semantics for
                                                        interpreting structure
        Query presentation
                                                        matches.
        Query processing by DB engine
                                                              [Qin et al, SIGMOD09]
      Leverage the power of underlying DB query engine

                 Alice   Bob                      KIT




                                                                              Result 1

                                                                              Result 2
41
Structure
      Keyword search: keywords over data graphs
        Term matching
        Content matching
        Structure matching
      Schema-based keyword search
      Schema-agnostic keyword search
        Online search algorithms
        Index-based approaches




42
Matching in keyword search – schema-agnostic
      Operate on data graph
        No schema needed
        Flexibly support different types of data e.g. hybrid data
         graphs
        Native tailored optimization
      Online in-memory graph search [Kacholia et al, VLDB05]
                                        [He et al, SIGMOD07] [Li et al, SIGMOD08]
      Using materialized indexes                     [Tran et al, CIKM11]

                Alice   Bob                  KIT    Linguistic semantics for
                                                    term matching, relational
                                                    semantics in the data for
                                                    interpreting structure
                                                    matches.
                                                                        Result 1

                                                                        Result 2
43
Structure
      Keyword search: keywords over data graphs
        Term matching
        Content matching
        Structure matching
      Schema-based keyword search
      Schema-agnostic keyword search
        Online search algorithms
        Index-based approaches




44
Online search – top-k exploration
      Compute Steiner tree with distinct roots
      Backward expansion strategy
      Run Dijkstra’s single-source-shortest-path algorithms
          Explore shortest keyword-root paths       [Bhalotia et al, ICDE02]
          To find root (an answer)
          Until k answers are found
          Approximate: no top-k guarantee, i.e. further answers found later
           from other expansion paths may have higher score
      Complete top-k: terminate safely when lower bound of top-k
       candidate is higher than upper bound of what can be achieved
       with remaining inputs [Tran et al, ICDE09]
                  Alice   Bob                     KIT



                                                                        Result 1



45
Online search – dynamic programming
                                                                     [Ding et al, ICDE07]
      Search problem has optimal substructure, i.e.
      optimal solutions constructed from optimal solutions of
      subproblems
        T(v,p,h): rooted at node v ∈ V, height ≤ h, containing a set
        of keywords p, minimum cost
      Result computation formulated as a recursive series of
      simpler calculations
        Tree grow:       v p h         min v u             u p, h
                                        u N (v)


        Tree merge:      v p1     p2 h            min      v p1 , h         v p2 , h
                                                  p1   p1

                                  Bob
                      Alice             KIT                  Alice               Alice

                              v                             v
46                                                                      u                   v
Structure
      Taxonomy of matching approaches
      Keyword search: keywords over data graphs
        Term matching
        Content matching
        Structure matching
      Schema-based keyword search
      Schema-agnostic keyword search
        Online search algorithms
        Index-based approaches




47
Index-based
       • Retrieve keyword elements
          • Using inverted index
              ki  {< n1, score, ...>, < n1, score, ...>,…}
       • Retrieve parts of results using materialized index (paths up to graphs)
       • Combine via path “join” or graph pruning




          Alice     Bob        KIT                  Alice   Bob          KIT

               ↔           ↔


                              =
                   =
Alice ns:knows Bob              Inst1 ns:name KIT
                Bob ns:works Inst1


  48
Index-based – path index
           Path index (retrieval) + selection top-k (combine)
                                                     [He et al, SIGMOD07]
      Results: Steiner trees with distinct root semantics
      Index shortest node-node path (distance) for all
       possible pairs of nodes
      Results computed via selection top-k (TA algorithm)
         Each candidate ri is an object with |q| attributes, i.e.
          shortest distance (minimal cost) from ri to all keyword
          nodes k1, k2, …, k|q|
         Score is aggregation of attribute score (cost)


      Root     Alice   Bob     KIT                          Bob
                                             Alice     r1         KIT   r2
      r1 = 3   1       1       1
      r2 = 6   3       2       1
                                                       r3
      r3 = 7   1       3       3
49
Index-based
         Selection top-k (combine)
      TA algorithm
        |q| inputs, sorted according to cost (using index)
        While |results| < k
          In round-robin fashion, select keyword node ki
          Using node-to-node index (keyword-to-root), retrieve root ri
          Retrieving other attribute nodes of ri via root-to-keyword lookup
          Add to candidate list



       Alice    Bob       KIT                                     Bob
                                                     Alice   r1         KIT   r2
                                   1) r1 = 1+1+1=3
       1  r1   1  r1    1  r1   2) r2 = 2+1+3=6
       1 r3    2  r2    1  r2   3) r1 = 3+1+3=7
                                                             r3
       3  r2   3  r3    3  r3


50
Index-based
         Graph index (retrieval) + graph pruning (combine)
                                                   [Li et al, SIGMOD08]
      Index r-radius graphs
        Subgraphs with radius r
        Maximal  pruning redundant overlapping graphs
        ki Gki
      Compute r-Radius Steiner graphs
        Retrieved Gki for every ki in q
        Union Gki , i.e. the set of r-radius graphs that contain all or a
         portion of the keywords in q
        Pruning: extract non-Steiner nodes, i.e. those that do not
         participate in paths connecting keyword elements
              Alice   Bob                    KIT



                                                                          Result 1



51
Index-based
         2-hop cover graph index (retrieval)     [Ladwig et al, CIKM11]

      Use d-length 2-hop cover for graph indexing, i.e. a set of
       neighbourhood labels NBn:
        If there is a path of length d or less between u and v then

                    NBu         NBv        empty
        All paths of length d or less between u and v are (w is the hop
         node)

                 u,...,w,...,v , w NBu                   NBv
      Trivially, set of d-length neighbourhoods is a d-length 2-hop
       cover, fined grained pruning a path level reduces that size!
                       Alice   Bob




52
Index-based
         Join top-k (combine)
      Result: subgraphs with query-specific online scores
      Use neighborhoods of cover to find paths between
       every pair of keyword elements and join them until
       they are all connected
      Process
        Data access to retrieve keyword
         neighborhoods
        Neighborhood join to obtain a
         keyword graph
                                               Alice      Bob      KIT
        Graph joins to combine keyword
         neighborhood with a keyword graph
        Join = RankJoin (i.e. use existing join top-k techniques)

53
Index-based
        Join top-k (combine) – neighborhood join

       Join two keyword neighborhoods
       Two path entries are joined when same hop node

                                   o1        o1                o3        hop node

                                                                     center node
             l1        p4                    p4        p2


                                   p3        p3                l2


     Result: keyword graphs
                              p4        o1        p2        all paths of length d
                                                            between p4 and p2
                              p4                  p2        through o1

                              p4        p3        p2
54
Index-based
          Join top-k (combine) – graph join

      Expand keyword graphs to keyword graph
      neighborhoods
     Keyword Graph                   Keyword Graph Neighborhood
     p4       o1      p2                   p4       o1       p2   o3

                                           p4       o1       p2   l2

                                l1         p4       o1       p2

                                                     ...

      Graph Join: joins keyword graph neighborhood with
      keyword neighborhood


55
Taxonomy of matching approaches
         Conceptual semantics

      Schema-based vs. schema-agnostic
      Online search
                                Relational semantics
        Complete top-k         in the data
        Approximate top-k
        Backward expansion, bidirectional search, undirected
        subgraph exploration, dynamic programming
      Indexing for retrieval + join for combine
        Path retrieval, then path join
        Graph retrieval, then graph pruning
        Graph retrieval, then neighborhood / graph join
        (neighborhood indexed as a set of paths)
56
Ranking
Structure
      Ranking paradigms
        Explicit model of relevance
        No notion of relevance
      Features
        Content-based
        Structure-based               Relational
        Structured-content-based      semantics




58
Ranking paradigms
      No explicit notion of relevance: similarity between the
      query and the document model
        Vector space model (cosine similarity)
        Sim(q, d )         Cos(( w1,d ,..., wt ,d ), ( w1,q ,..., wk ,q ))
        Language models (KL divergence)
                                                                               P(t |   q   )
 Sim(q, d )           KL(       q   ||      d   )         P(t |   q   ) log(
                                                    t V                        P(t |   d   )
      Explicit relevance model
        Foundation: probability ranking principle
        Ranking results by the posterior probability (odds) of
        being observed in the relevant class:
         P( D | R)         P(t | R)         (1 P(t | N ))
                     t D              t D
59
Features
      Features are orthogonal to retrieval models
        Weights for query / document vectors?
        Language models for document / queries?
        Relevance models?
        What to use for learning to rank?




60
Features
       Dealing with ambiguities

      Content features                 Term
                                        ambiguity
        Co-occurrences
          Terms K that often co-occur form a contextual interpretation, i.e.
           topics (cluster hypothesis, distributional semantics)     Content
          “Berlin” and “apartment”  geographic context  Berlin as city
                                                                     ambiguity
        Frequencies: d more likely to be “about” a query term k
         when d more often, mentions k (probabilistic IR)
      Structure features
        Structured-content-based: consider relevance at fine-
         grained level of attributes
                                          Structure       Relational
        Link-based popularity            ambiguity       semantics
        Proximity-based
      Semantics captured in conceptual and linguistic models?
        Only exploited for matching to generate candidates so far
61
Content-based features – frequency
      Document statistics, e.g.        • An object is more likely
        Term frequency                 about “Berlin”?
        Document length                   • When it contains a
   Collection statistics, e.g.            relatively high
      Inverse document frequency          number of mentions
                 tf                        of the term “Berlin”
        wt , d       idf                   • When number of
                |d |
                                           mentions of term in
   Background language models
                                           the overall collection
                tf                         is relatively low
P(t | d )            (1     ) P(t | C )
               |d |



62
Structure-based features – links
                                                     How to incorporate it
       PageRank
                                                     into a content-based
           Link analysis algorithm                  retrieval model?
           Measuring relative importance of nodes
           Link counts as a vote of support
           The PageRank of a node recursively depends on the
            number and PageRank of all nodes that link to it
            (incoming links)
       ObjectRank      [Hristidis et al, TDS08]
         Types and semantics of links vary in structured data
         Authority transfer schema graph specifies connection
          strengths
         Recursively compute authority transfer data graph

     • An object (about “Berlin”) is more important?
        • When a relatively large number of objects are linked to it
63
Structure-based features – proximity
                                         adopted from: [Chen et al, SIGMOD09]

  EASE, XRANK, BLINKS, etc.                             How to incorporate it
  EASE [Li et al, SIGMOD08]                             into a content-based
                                                         retrieval model?
    Proximity between a pair of keywords




     Overall score of a JRT is aggregation on the score of keyword pairs
  XRANK [Guo et al, SIGMOD03]
    Ranking of XML documents / elements
    Proximity of n is defined based on w, the smallest text window in n
     that contains all search keywords


   • A structured result (e.g. Steiner tree) is more relevant?
64
       • When it is more compact s.t. elements are closely related
Structured-content-based model
 Consider structure of objects during content-based
  modeling, i.e., to obtain structured content-based
  model
   Content-based model for structured objects, structured
    documents, database tuples…
       P(t |   d   )          f   P(t |   f   )
                       f Fd


• An object is more likely about “Berlin”?
   • When its (important) fields / attributes contain a
   relatively high number of mentions of the term “Berlin”
Structured-content-based model
          Relevance model           [Lavrenko et al, SIGIR01]


                              q1      Israeli
                                                                       sample probabilities
                                                                       P(w|Q)           w
                 M            q2      Palestinian                         .077 palestinian
                                                                          .055 israel
                 M            q3      raids                               .034 jerusalem
                 M                                                        .033 protest
                              w       ???                                 .027 raid
                                                                          .011 clash
                                                                          .010 bank
                                     P( w, q1...qk )                      .010 west
     P( w | R)   P( w | q1...qk )                                         .010 troop
                                      P(q1...qk )
                                                                                …

                                                    k
 P ( w, q1...qk )            P( M ) P( w | M )           P (qi | M )
                      M UM                         i 1



66
Structured-content-based model
          Edge-specific relevance model      [Bicer et al, CIKM11]


      Given a query Q={q1,…,qn}, a set of resources (FR) are retrieved
        E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}
      Based on FR results, an edge specific RMFR is constructed for each
       unique edge e:




67
Structured-content-based model
          Edge-specific resource model
      Edge-specific resource model:



        Smoothing with model for the entire resource
      Use RM for query expansion: the score of a resource
      calculated based on cross-entropy of edge-specific RMFR
      and edge-specific RMr:




        Alpha allows to control the importance of edges




68
Ranking Steiner tree / join result tuples (JRT)
      Ranking aggregated JRTs:
        The cross entropy between the edge-specific RMFR (query model) and
         geometric mean of combined edge-specific RMJRT:




      The proposed ranking function is monotonic with respect to the
       individual resource scores (a necessary property for using top-k
       keyword search algorithms)



69
Taxonomy of ranking approaches
      Explicitly vs. non-explicitly relevance-based
      Different approaches for model construction
        Content-based ranking
        Structure-based ranking
                                               Relational semantics
        Structured-content-based ranking      in the data




70
Conclusions
Conclusions
      Semantic search is about using semantics in
      structured data, conceptual and linguistic models
        Interpreting term and content, inferring relationships
        Deal with ambiguities at term, content and structure level
        Mainly used during matching to generate candidates
        Semantics in structured data can improve ranking
      Keyword search on structure data as semantic search
        Support complex information needs (long tail)
        Exploit linguistic models to deal with ambiguity of
         keywords and conceptual / relational semantics to
         interpret their possible connections (structure)
        Complexity requires specialized indexes and efficient
         exploration algorithms
72
…Selected challenges
      Richer, high quality model of text
        Large-scale knowledge extraction / linking
        New models for interacting with and maintaining hybrid content
      Hybrid content management
        Indexing hybrid content
        Processing hybrid queries
        Ranking hybrid results (facts combined with text)
      Querying paradigm for complex retrieval tasks
        Keywords?
        Keywords + facets?
      Rich retrieval process: from querying to browsing to intuitive
       presentation, supporting complex analysis of data / results

73
References
   Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based
         search over relational databases. In ICDE, pages 5-16.
        Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges
         and opportunities. In VLDB, page 1368.
        Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with
         relevance oriented ranking. In ICDE, pages 517-528.
        Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword
         Searching and Browsing in Databases using BANKS. In ICDE, pages 431-440.
        Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using
         Relevance Models. In CIKM.
        Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.
         (2009): DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS)
         7(3):154-165
        Chen, Y., Wang, W., Liu, Z., Lin, X.(2009): Keyword search on structured and semi-
         structured data. SIGMOD 2009:1005-1010
        Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-
         cost connected trees in databases. In ICDE, pages 836-845.
        Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex
         data graphs. In SIGMOD, pages 927-940.
        Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked
         keyword search over XML documents. In SIGMOD.
        He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on
         graphs. In SIGMOD, pages 305-316.
        Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword
         search in databases. ACM Trans. Database Syst., 33(1):1-40
75
   Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational
         databases. In VLDB.
        Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H.
         (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pages
         505-516.
        Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword
         proximity search. In PODS, pages 173-182.
        Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native
         Keyword Search Databases. In CIKM.
        Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages
         120-127.
        Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1
         keyword search method for unstructured, semi-structured and structured data. In
         SIGMOD.
        Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in
         relational databases. In SIGMOD, pages 563-574.
        Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in
         relational databases. In SIGMOD, pages 115-126.
        Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS.
         In SIGMOD, pages 681-694.
        Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the
         Search Process. In Journal of Web Semantics, 2011.
        Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph
         Candidates for Efficient Keyword Search on RDF. In ICDE.
        Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search
76       over relational databases. In VLDB.
Thanks!




                                   Tran Duc Thanh
                              ducthanh.tran@kit.edu
          http://sites.google.com/site/kimducthanh/

Weitere ähnliche Inhalte

Ähnlich wie ESSIR 2011 Semantic Search Tutorial

Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...Benjamin Heitmann
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Dagstuhl FOAF history talk
Dagstuhl FOAF history talkDagstuhl FOAF history talk
Dagstuhl FOAF history talkDan Brickley
 
Similarity on DBpedia
Similarity on DBpediaSimilarity on DBpedia
Similarity on DBpediaSamantha Lam
 
ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ Prateek Jain
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
NCompass Live: Linked Data and Libraries: What? Why? How?
NCompass Live: Linked Data and Libraries: What? Why? How?NCompass Live: Linked Data and Libraries: What? Why? How?
NCompass Live: Linked Data and Libraries: What? Why? How?Nebraska Library Commission
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overviewAmit Sheth
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsAndre Freitas
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWC
Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWCFueling the future with Semantic Web patterns - Keynote at WOP2014@ISWC
Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWCValentina Presutti
 
Digital Archiving, The Semantic Web, and Modern AI
Digital Archiving, The Semantic Web, and Modern AIDigital Archiving, The Semantic Web, and Modern AI
Digital Archiving, The Semantic Web, and Modern AIJames Hendler
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedJakob .
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 

Ähnlich wie ESSIR 2011 Semantic Search Tutorial (20)

NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
 
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...Enabling Case-Based Reasoning  on the Web of Data (How to create a Web of Exp...
Enabling Case-Based Reasoning on the Web of Data (How to create a Web of Exp...
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Dagstuhl FOAF history talk
Dagstuhl FOAF history talkDagstuhl FOAF history talk
Dagstuhl FOAF history talk
 
Similarity on DBpedia
Similarity on DBpediaSimilarity on DBpedia
Similarity on DBpedia
 
ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
NCompass Live: Linked Data and Libraries: What? Why? How?
NCompass Live: Linked Data and Libraries: What? Why? How?NCompass Live: Linked Data and Libraries: What? Why? How?
NCompass Live: Linked Data and Libraries: What? Why? How?
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
PhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek JainPhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek Jain
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWC
Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWCFueling the future with Semantic Web patterns - Keynote at WOP2014@ISWC
Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWC
 
Digital Archiving, The Semantic Web, and Modern AI
Digital Archiving, The Semantic Web, and Modern AIDigital Archiving, The Semantic Web, and Modern AI
Digital Archiving, The Semantic Web, and Modern AI
 
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystified
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 

Kürzlich hochgeladen

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 

Kürzlich hochgeladen (20)

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 

ESSIR 2011 Semantic Search Tutorial

  • 1. Semantic Search Focus: IR on Structured Data 8th European Summer School on Information Retrieval Duc Thanh Tran Institute AIFB, KIT, Germany Tran@aifb.uni-karlsruhe.de http://sites.google.com/site/kimducthanh
  • 2. Agenda  Why Semantic Search?  What is Semantic Search?  A Semantic Search direction - IR on structured data  Matching  Ranking  Conclusions
  • 4. Why Semantic Search? Many of these queries would not  Solve main classes of queries, e.g. navigational be asked by users, who learned over time what search  But long tail queries… technology can and can not do.  “teacher math class Goethe” These queries require precise  Several problematic cases understanding of the underlying information needs and data, and  Ambiguous / imprecise queries aggregating results.  “Paris Hilton”  “strong adventures people from Germany”  Specific, complex queries (factual, aggregated)  “32 year old computer scientist living in Karlsruhe”  “digital camera under 300 dollars produced by canon in 1992” 4
  • 5. Why Semantic Search?  Towards a Semantic Web  Large number of Web data vocabularies published in RDFS and OWL  Schema.org  Dbpedia ontology  Large amounts of data published in RDF / RDFa  Linked Data  Embedded metadata Semantics captured by taxonomies, ontologies, structur ed metadata can help to obtain precise understanding, to aggregate information from different sources, and to retrieve relevant results! 5
  • 6. Vocabularies  DBpedia ontology from : http://wiki.dbpedia.org/Ontology 6
  • 7. Vocabularies DBpedia [Bizer et al, JWS02] from : http://wiki.dbpedia.org/Ontology 7
  • 9. Structured Data Resource Description Framework (RDF)  Each resource (thing, entity) is identified by a URI  Entity descriptions as sets of facts  Triples of (subject, predicate, object)  A set of triples is published together in an RDF document (forming an RDF graph) adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ 9
  • 10. Structured Data Linked Data source: http://linkeddata.org/ 10
  • 11. Metadata RDFa on the rise 510% increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats 11 from : http://www.slideshare.net/pmika/semtech-2011-semantic-search-tutorial
  • 12. Metadata RDFa … <div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div> </div> … adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ 12
  • 13. Metadata RDFa Bob is a good friend of mine. We content went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ 13
  • 14. What is Semantic Search?
  • 15. Structure  Semantics  Search tasks  Document, data, social media, multimedia  Core search problems  Semantic search exploits semantics  For search tasks  For search problems  Many Semantic Search directions
  • 16. Semantics  Semantics is concerned with the meaning of query, data and background knowledge  Distributional hypothesis / statistical semantics  “a word is characterized by the company it keeps”  Based on word patterns (co-occurrence frequency of the context words near a given target word)  Explicit semantics  Various explicit representations of meaning 16
  • 17. Explicit Semantics  Linguistic models: relationships among terms  Term relationships: synonymous, hyponymous, broader, narrower…  Taxonomies, thesauri, dictionaries of entity names  Examples: WordNet, Roget’s Thesaurus  Conceptual models: relationships among classes of objects  Abstract and conceptual representation of data  Concepts, RDFS classes, associations, relationships, attributes…  Terminological part (T-Box) of ontologies, DB schema e.g. relational model  Examples: SUMO, DBpedia  Structured data: relationships among objects  Description of concrete objects  Tuples, instances, entities, RDF resources, foreign keys, relationships, attributes,…  Assertional part of ontologies (A-Box), DB instance  Examples: Linked Data, metadata 17
  • 18. Search tasks – document retrieval  Search on textual data (documents, Web pages)  Mainly studied in the IR community  Data and queries  Term-based representation  Search algorithms  Retrieve documents relevant for query keywords  Match query term against terms / content of documents  Leverage statistical semantics for dealing with ambiguity and for ranking  Optimized, work well for navigational, topical search  Less so for complex information needs  Web scale 18
  • 19. Search tasks – data retrieval  Focus on structured data and retrieve direct answers  Data and queries  Structured models  Search algorithms  Retrieve direct answers that match structured queries  Structure matching: term / content based relevance less the focus, but structure filtering based on joins  Use relational semantics in structured data  Optimized for complex structured information needs / queries, less so for text-based relevance  More complex processing  efficiency, scalability 19
  • 20. Search tasks Addressing complex information needs Combination of data and document retrieval  Movies directed by Stephen Structured data with Spielberg where synopsis textual attribute values mentions dinosaurs. (content, description)  Publications authored by 32 year old computer scientist Documents with living in Karlsruhe, which metadata mention Semantic Search  Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT 20
  • 21. Search tasks e.g. combination of data and document retrieval  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”. <shared apartment in Berlin with Alice> <knows someone in the field of Semantic <friend of Alice> Search working at KIT> trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same Germany Semantic Alice Search university, and also shared an apartment in Berlin in 2008. The trouble Germany 2009 Bob with Bob is that he Thanh takes much better photos than I do: KIT 21
  • 22. Core search problems Term ambiguity apartment shared Berlin Alice knows someone works at KIT trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same university, Germany Semantic Alice Search and also shared an apartment in Berlin in 2008. The trouble with Bob is that he Germany 2009 Bob takes much better Thanh photos than I do: KIT Syntax / Semantic Is “BerllinNN” same as “Berlin”? What is meant by “KIT”? 22
  • 23. Core search problems Structure ambiguity apartment shared Berlin Alice knows someone works at KIT trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset Semantic the same university, Germany Alice Search and also shared an apartment in Berlin in 2008. The trouble Germany 2009 with Bob is that he Bob Thanh takes much better photos than I do: KIT Explicit semantics in structured data reduces structure ambiguity What is the connection between “Berlin” What is the relationship between and “Alice”? “someone” and KIT? 23
  • 24. Core search problems Content ambiguity apartment shared Berlin Alice knows someone works at KIT trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same Germany Semantic Alice Search university, and also shared an apartment in Berlin in 2008. The trouble Germany 2009 Bob with Bob is that he Thanh takes much better photos than I do: KIT Understanding of term and structure in content helps! Is the document about Berlin (as a city)? Is the graph about “apartment shared Berlin Is the element’s content / label about KIT? Alice knows someone works at KIT”? 24
  • 25. Core search problems Dealing with ambiguities: matching and ranking 2 scenarios: Ranked: ambiguities ambiguity (IR) vs.  Exact no ambiguity (DB) in query and data  Complete representation   Sound results cannot be Query guaranteed to exactly match the query (i.e.  Approximate multiple interpretations  Not complete Matching lead to multiple non-  Not sound equivalent matches).  Both the above Ambiguities at level of  Ranked elements (term, content) Data  Matching + and relationships ranking between elements  Top-k (structure) Matching mainly focuses on efficiency of computing matches 25 whereas ranking deals with degree of matching (relevance)!
  • 26. Search  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”.  Term: which Berlin? Distributional semantics /  Content: which documents are about Berlin? statistical reasoning over topic models, language models Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: 26
  • 27. Semantic Search  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”.  Structure: friend of, knows, shares Relational semantics of structured data in various datasets Bob Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: 27
  • 28. Semantic Search  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”.  Term: creator is a subclass of person Semantics captured in conceptual models, e.g. class subsumption, instance classification (logic-based reasoning) picture creator person Bob Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: 28
  • 29. Semantic Search  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”.  Term: image is synonym of picture Semantics captured in linguistic models, e.g. reasoning over term relationships drawing image picture poster picture creator person Bob Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: 29
  • 30. Semantic Search  A retrieval paradigm that exploits semantics of  data, query, background knowledge [Tran et al, JWS11] to interpret and incorporate the intent of query and the meaning of data into the search algorithms (more generally: search process)  Different directions, employing various models of semantics of  Terms (linguistic models)  Concepts (conceptual models) to deal with ambiguous queries  Relational information (structured data) in  Different datasets to produce complex structured, aggregated results to answer complex information needs  Orthogonal to retrieval tasks / specific type of approaches  Document retrieval  Data retrieval  Multimedia retrieval 30  Social media retrieval, multilingual retrieval…
  • 31. Semantic Search ???? Semantic Search Information Data Retrieval Multimedia Retrieval (IR) (DB) Retrieval Keyword query X X x Structured query x X Textual data X X x Structured data X x X Conceptual model X X Linguistic model X x Term matching X X x Structure X X matching Content matching X X Ranking X X x
  • 32. Focus of the following technical parts IR on structured data • Motivation • IR is user-centric! • Text-based querying paradigms more intuitive for end-users! • Keyword search widely adopted! • Focus • Keyword query on structured data, i.e. “a direction” of semantic search, which employs semantics of  Relational information (structured data) in  Different datasets to produce complex structured, aggregated results to answer complex information needs  Similar, complementary to DB keyword search tutorial, emphasizes [Chen et al, SIGMOD09]  The role of textual data: data graphs with textual content nodes  The role of semantics  Ranking
  • 34. Structure  Keyword search: keywords over data graphs  Term matching  Content matching  Structure matching  Schema-based keyword search  Schema-agnostic keyword search  Online search algorithms  Index-based approaches 34
  • 35. Keyword search approaches  Finding “substructures” matching keyword nodes  Different result semantics for different types of data  Textual data (Web pages connected via hyperlink)  DB (tuple connected via foreign keys)  XML (elements/attributes via parent-child edges)  Commonly used results: Steiner tree / subgraph  Connect keyword matching elements  Contain one keyword matching element for every query keyword  Minimal substructures: closely connected keyword nodes  Query is ambiguous, lacks explicit structure constraints  NP-hard, thus efficiency of matching is a problem  Large amounts on candidate matches, thus ranking is a problem 35
  • 36. Keyword search on hybrid data graphs Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” apartment shared Berlin Alice knows someone works at KIT Term matching trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same university, Germany Semantic Alice Search and also shared an apartment in Berlin Content in 2008. The trouble matching Germany with Bob is that he Bob 2009 takes much better Thanh photos than I do: Structure KIT matching 36
  • 37. Term matching  Distance-based (syntax)  Levenshtein distance (edit distance)  Hamming distance  Jaro-Winkler distance  Dictionary-based (semantics)  Taxonomy Term matching via reasoning  Dictionary of similar words over term relationships, e.g. image is synonym of picture  Translation memory  Ontologies Term matching via reasoning over concepts, e.g. creator is a subclass of person 37
  • 38. Content matching • Retrieve partial matches • Inverted list (inverted index) ki  {< d1, pos, score, ...>, < d2, pos, score, ...>, ...} • Combine partial matches: union or join shared Berlin Alice shared Berlin Alice D1 D1 D1 shared = berlin = alice shared 38
  • 39. Structure matching Structure not explicitly given in query  exploration • Retrieve structured data given patterns (e.g. triple patterns) / other kinds of join • Index on tables • Multiple “redundant” indexes to cover different access patterns • Combine: union or join • Blocking, e.g. linear merge join (required sorted input) • Non-blocking, e.g. symmetric hash-join • Materialized join indexes ?x ns:knows ?y. ?x ns:knows ?z. Per1 ns:works ?v ?v ns:name “KIT” SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT” = = = Per1 ns:works Ins1 Ins1 ns:name KIT Per1 ns:works Ins1 Ins1 ns:name KIT 39
  • 40. Structure  Keyword search: keywords over data graphs  Term matching  Content matching  Structure matching  Schema-based keyword search  Schema-agnostic keyword search  Online search algorithms  Index-based approaches 40
  • 41. Matching in keyword search – schema-based  Operate on schema graph [Hristidis et al, VLDB02] [Agrawal et al, ICDE02] Linguistic semantics for  Query interpretation [Tran et al, ICDE09] term matching, conceptual  Compute queries instead of results and relational semantics for interpreting structure  Query presentation matches.  Query processing by DB engine [Qin et al, SIGMOD09]  Leverage the power of underlying DB query engine Alice Bob KIT Result 1 Result 2 41
  • 42. Structure  Keyword search: keywords over data graphs  Term matching  Content matching  Structure matching  Schema-based keyword search  Schema-agnostic keyword search  Online search algorithms  Index-based approaches 42
  • 43. Matching in keyword search – schema-agnostic  Operate on data graph  No schema needed  Flexibly support different types of data e.g. hybrid data graphs  Native tailored optimization  Online in-memory graph search [Kacholia et al, VLDB05] [He et al, SIGMOD07] [Li et al, SIGMOD08]  Using materialized indexes [Tran et al, CIKM11] Alice Bob KIT Linguistic semantics for term matching, relational semantics in the data for interpreting structure matches. Result 1 Result 2 43
  • 44. Structure  Keyword search: keywords over data graphs  Term matching  Content matching  Structure matching  Schema-based keyword search  Schema-agnostic keyword search  Online search algorithms  Index-based approaches 44
  • 45. Online search – top-k exploration  Compute Steiner tree with distinct roots  Backward expansion strategy  Run Dijkstra’s single-source-shortest-path algorithms  Explore shortest keyword-root paths [Bhalotia et al, ICDE02]  To find root (an answer)  Until k answers are found  Approximate: no top-k guarantee, i.e. further answers found later from other expansion paths may have higher score  Complete top-k: terminate safely when lower bound of top-k candidate is higher than upper bound of what can be achieved with remaining inputs [Tran et al, ICDE09] Alice Bob KIT Result 1 45
  • 46. Online search – dynamic programming [Ding et al, ICDE07]  Search problem has optimal substructure, i.e. optimal solutions constructed from optimal solutions of subproblems  T(v,p,h): rooted at node v ∈ V, height ≤ h, containing a set of keywords p, minimum cost  Result computation formulated as a recursive series of simpler calculations  Tree grow: v p h min v u u p, h u N (v)  Tree merge: v p1 p2 h min v p1 , h v p2 , h p1 p1 Bob Alice KIT Alice Alice v v 46 u v
  • 47. Structure  Taxonomy of matching approaches  Keyword search: keywords over data graphs  Term matching  Content matching  Structure matching  Schema-based keyword search  Schema-agnostic keyword search  Online search algorithms  Index-based approaches 47
  • 48. Index-based • Retrieve keyword elements • Using inverted index ki  {< n1, score, ...>, < n1, score, ...>,…} • Retrieve parts of results using materialized index (paths up to graphs) • Combine via path “join” or graph pruning Alice Bob KIT Alice Bob KIT ↔ ↔ = = Alice ns:knows Bob Inst1 ns:name KIT Bob ns:works Inst1 48
  • 49. Index-based – path index Path index (retrieval) + selection top-k (combine) [He et al, SIGMOD07]  Results: Steiner trees with distinct root semantics  Index shortest node-node path (distance) for all possible pairs of nodes  Results computed via selection top-k (TA algorithm)  Each candidate ri is an object with |q| attributes, i.e. shortest distance (minimal cost) from ri to all keyword nodes k1, k2, …, k|q|  Score is aggregation of attribute score (cost) Root Alice Bob KIT Bob Alice r1 KIT r2 r1 = 3 1 1 1 r2 = 6 3 2 1 r3 r3 = 7 1 3 3 49
  • 50. Index-based Selection top-k (combine)  TA algorithm  |q| inputs, sorted according to cost (using index)  While |results| < k  In round-robin fashion, select keyword node ki  Using node-to-node index (keyword-to-root), retrieve root ri  Retrieving other attribute nodes of ri via root-to-keyword lookup  Add to candidate list Alice Bob KIT Bob Alice r1 KIT r2 1) r1 = 1+1+1=3 1  r1 1  r1 1  r1 2) r2 = 2+1+3=6 1 r3 2  r2 1  r2 3) r1 = 3+1+3=7 r3 3  r2 3  r3 3  r3 50
  • 51. Index-based Graph index (retrieval) + graph pruning (combine) [Li et al, SIGMOD08]  Index r-radius graphs  Subgraphs with radius r  Maximal  pruning redundant overlapping graphs  ki Gki  Compute r-Radius Steiner graphs  Retrieved Gki for every ki in q  Union Gki , i.e. the set of r-radius graphs that contain all or a portion of the keywords in q  Pruning: extract non-Steiner nodes, i.e. those that do not participate in paths connecting keyword elements Alice Bob KIT Result 1 51
  • 52. Index-based 2-hop cover graph index (retrieval) [Ladwig et al, CIKM11]  Use d-length 2-hop cover for graph indexing, i.e. a set of neighbourhood labels NBn:  If there is a path of length d or less between u and v then NBu NBv empty  All paths of length d or less between u and v are (w is the hop node) u,...,w,...,v , w NBu NBv  Trivially, set of d-length neighbourhoods is a d-length 2-hop cover, fined grained pruning a path level reduces that size! Alice Bob 52
  • 53. Index-based Join top-k (combine)  Result: subgraphs with query-specific online scores  Use neighborhoods of cover to find paths between every pair of keyword elements and join them until they are all connected  Process  Data access to retrieve keyword neighborhoods  Neighborhood join to obtain a keyword graph Alice Bob KIT  Graph joins to combine keyword neighborhood with a keyword graph  Join = RankJoin (i.e. use existing join top-k techniques) 53
  • 54. Index-based Join top-k (combine) – neighborhood join  Join two keyword neighborhoods  Two path entries are joined when same hop node o1 o1 o3 hop node center node l1 p4 p4 p2 p3 p3 l2 Result: keyword graphs p4 o1 p2 all paths of length d between p4 and p2 p4 p2 through o1 p4 p3 p2 54
  • 55. Index-based Join top-k (combine) – graph join  Expand keyword graphs to keyword graph neighborhoods Keyword Graph Keyword Graph Neighborhood p4 o1 p2 p4 o1 p2 o3 p4 o1 p2 l2 l1 p4 o1 p2 ...  Graph Join: joins keyword graph neighborhood with keyword neighborhood 55
  • 56. Taxonomy of matching approaches Conceptual semantics  Schema-based vs. schema-agnostic  Online search Relational semantics  Complete top-k in the data  Approximate top-k  Backward expansion, bidirectional search, undirected subgraph exploration, dynamic programming  Indexing for retrieval + join for combine  Path retrieval, then path join  Graph retrieval, then graph pruning  Graph retrieval, then neighborhood / graph join (neighborhood indexed as a set of paths) 56
  • 58. Structure  Ranking paradigms  Explicit model of relevance  No notion of relevance  Features  Content-based  Structure-based Relational  Structured-content-based semantics 58
  • 59. Ranking paradigms  No explicit notion of relevance: similarity between the query and the document model  Vector space model (cosine similarity) Sim(q, d ) Cos(( w1,d ,..., wt ,d ), ( w1,q ,..., wk ,q ))  Language models (KL divergence) P(t | q ) Sim(q, d ) KL( q || d ) P(t | q ) log( t V P(t | d )  Explicit relevance model  Foundation: probability ranking principle  Ranking results by the posterior probability (odds) of being observed in the relevant class: P( D | R) P(t | R) (1 P(t | N )) t D t D 59
  • 60. Features  Features are orthogonal to retrieval models  Weights for query / document vectors?  Language models for document / queries?  Relevance models?  What to use for learning to rank? 60
  • 61. Features Dealing with ambiguities  Content features Term ambiguity  Co-occurrences  Terms K that often co-occur form a contextual interpretation, i.e. topics (cluster hypothesis, distributional semantics) Content  “Berlin” and “apartment”  geographic context  Berlin as city ambiguity  Frequencies: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR)  Structure features  Structured-content-based: consider relevance at fine- grained level of attributes Structure Relational  Link-based popularity ambiguity semantics  Proximity-based  Semantics captured in conceptual and linguistic models?  Only exploited for matching to generate candidates so far 61
  • 62. Content-based features – frequency  Document statistics, e.g. • An object is more likely  Term frequency about “Berlin”?  Document length • When it contains a  Collection statistics, e.g. relatively high  Inverse document frequency number of mentions tf of the term “Berlin” wt , d idf • When number of |d | mentions of term in  Background language models the overall collection tf is relatively low P(t | d ) (1 ) P(t | C ) |d | 62
  • 63. Structure-based features – links How to incorporate it  PageRank into a content-based  Link analysis algorithm retrieval model?  Measuring relative importance of nodes  Link counts as a vote of support  The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links)  ObjectRank [Hristidis et al, TDS08]  Types and semantics of links vary in structured data  Authority transfer schema graph specifies connection strengths  Recursively compute authority transfer data graph • An object (about “Berlin”) is more important? • When a relatively large number of objects are linked to it 63
  • 64. Structure-based features – proximity adopted from: [Chen et al, SIGMOD09]  EASE, XRANK, BLINKS, etc. How to incorporate it  EASE [Li et al, SIGMOD08] into a content-based retrieval model?  Proximity between a pair of keywords  Overall score of a JRT is aggregation on the score of keyword pairs  XRANK [Guo et al, SIGMOD03]  Ranking of XML documents / elements  Proximity of n is defined based on w, the smallest text window in n that contains all search keywords • A structured result (e.g. Steiner tree) is more relevant? 64 • When it is more compact s.t. elements are closely related
  • 65. Structured-content-based model  Consider structure of objects during content-based modeling, i.e., to obtain structured content-based model  Content-based model for structured objects, structured documents, database tuples… P(t | d ) f P(t | f ) f Fd • An object is more likely about “Berlin”? • When its (important) fields / attributes contain a relatively high number of mentions of the term “Berlin”
  • 66. Structured-content-based model Relevance model [Lavrenko et al, SIGIR01] q1 Israeli sample probabilities P(w|Q) w M q2 Palestinian .077 palestinian .055 israel M q3 raids .034 jerusalem M .033 protest w ??? .027 raid .011 clash .010 bank P( w, q1...qk ) .010 west P( w | R) P( w | q1...qk ) .010 troop P(q1...qk ) … k P ( w, q1...qk ) P( M ) P( w | M ) P (qi | M ) M UM i 1 66
  • 67. Structured-content-based model Edge-specific relevance model [Bicer et al, CIKM11]  Given a query Q={q1,…,qn}, a set of resources (FR) are retrieved  E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}  Based on FR results, an edge specific RMFR is constructed for each unique edge e: 67
  • 68. Structured-content-based model Edge-specific resource model  Edge-specific resource model:  Smoothing with model for the entire resource  Use RM for query expansion: the score of a resource calculated based on cross-entropy of edge-specific RMFR and edge-specific RMr:  Alpha allows to control the importance of edges 68
  • 69. Ranking Steiner tree / join result tuples (JRT)  Ranking aggregated JRTs:  The cross entropy between the edge-specific RMFR (query model) and geometric mean of combined edge-specific RMJRT:  The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k keyword search algorithms) 69
  • 70. Taxonomy of ranking approaches  Explicitly vs. non-explicitly relevance-based  Different approaches for model construction  Content-based ranking  Structure-based ranking Relational semantics  Structured-content-based ranking in the data 70
  • 72. Conclusions  Semantic search is about using semantics in structured data, conceptual and linguistic models  Interpreting term and content, inferring relationships  Deal with ambiguities at term, content and structure level  Mainly used during matching to generate candidates  Semantics in structured data can improve ranking  Keyword search on structure data as semantic search  Support complex information needs (long tail)  Exploit linguistic models to deal with ambiguity of keywords and conceptual / relational semantics to interpret their possible connections (structure)  Complexity requires specialized indexes and efficient exploration algorithms 72
  • 73. …Selected challenges  Richer, high quality model of text  Large-scale knowledge extraction / linking  New models for interacting with and maintaining hybrid content  Hybrid content management  Indexing hybrid content  Processing hybrid queries  Ranking hybrid results (facts combined with text)  Querying paradigm for complex retrieval tasks  Keywords?  Keywords + facets?  Rich retrieval process: from querying to browsing to intuitive presentation, supporting complex analysis of data / results 73
  • 75. Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search over relational databases. In ICDE, pages 5-16.  Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and opportunities. In VLDB, page 1368.  Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance oriented ranking. In ICDE, pages 517-528.  Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching and Browsing in Databases using BANKS. In ICDE, pages 431-440.  Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using Relevance Models. In CIKM.  Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009): DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165  Chen, Y., Wang, W., Liu, Z., Lin, X.(2009): Keyword search on structured and semi- structured data. SIGMOD 2009:1005-1010  Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min- cost connected trees in databases. In ICDE, pages 836-845.  Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data graphs. In SIGMOD, pages 927-940.  Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search over XML documents. In SIGMOD.  He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pages 305-316.  Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in databases. ACM Trans. Database Syst., 33(1):1-40 75
  • 76. Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In VLDB.  Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.  Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173-182.  Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword Search Databases. In CIKM.  Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127.  Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD.  Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational databases. In SIGMOD, pages 563-574.  Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational databases. In SIGMOD, pages 115-126.  Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD, pages 681-694.  Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search Process. In Journal of Web Semantics, 2011.  Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF. In ICDE.  Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search 76 over relational databases. In VLDB.
  • 77. Thanks! Tran Duc Thanh ducthanh.tran@kit.edu http://sites.google.com/site/kimducthanh/

Hinweis der Redaktion

  1. says something about summer rhine tour vineyard nature --&gt; beautiful great to be hereRelated to Web Science / development of the Web  exploit the increasingly richt structure and semantics captured by the WebNigel mentioned chalanges facing IR community when it comes to dealing with these Web dataPresent solutions hereNotion is still vaugesMany different directions of semantic searchI will try to provide a systematic account of various approaches and discuss the main / common ideas behind semantic search
  2. 5-10% of webpages contain some explicit metadataDepending on how you count…Too many competing approachesToo many formats: microformatsvsRDFavsMicrodataRDFa on the rise: 5 fold increase from 2009 to 2010
  3. - In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval) different search technologies target different tasksThe technical differences between these paradigms can be broken down into three main dimensions, i.e. the representation of the user need (query model), the underlying resources (data model) and the matching technique. The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, Can deal with topical relevanceSolve ambiguity in the sense of topical ambiguitythey usually fail to address more complex information needs.
  4. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.
  5. - Query as a set of constrains Match structured data Match text
  6. Term might be not correctly spelled, or may have many spelling variants  syntactic matchesA meaning intended by the user keyword may have different syntactic representation (can be associated with different terms) / keyword has different meanings, different interpretations multiple semantic matches
  7. Term level: find out meaning intended by query termsStructure level: find out structural relationships between entities / meanings behind these terms  in Berlin, Alice shared an apartment with a friend, Thanh works at KIT! Terms do notprecisely capture structure information  more interpretations, more structure ambiguityStructured data capture structural relationships among entities  less interpretations, less structure ambiguity
  8. (Even when term ambiguity is resolved) i.e. Berlin is known to be a city, KIT known to be an institute, Ambiguity still exists at content level in fact, ambiguity at term level is resolved using content, as content provided the “context”Then understanding of term (and structure) in content used to deal with content ambiguity!Complex content: not a document with term but a data graph with nodes representing resources (KIT) as well as content elements (documents), contain term and structure information!
  9. precise semantics of query and data Exact: precise semantics of query and data  exact matches according to semantics can be produced  all results, everyone of them correct Even when precise semantics is available: matching can be approximate for reasons of efficiency, however, not all results may be produced, and some of them may not be correctRanked matching = matching and ranking  matching may be exact and approximate Ranked matching may also be needed in that case, when queries too general such that we have to deal with too many resultsRanked more interesting when no precise semantics of query and data  are ambiguities in the sense of multiple interpretations then these interpretations have to be ranked!
  10. Semantic search can be seen as a retrieval paradigm Centered on the use of semanticsIncorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  11. Hybrid data graph with content nodes
  12. Content matching: not only one single term but several query terms (predicate)  not only one matching operations but also combining results of matches for parts of the query produced by several operationsInstead of online matching  index is needed for managing last amount of data and fast access to matches Matching can be decomposed into two operations: matching and combine Join : dictionary posting lists  intersect posting lists
  13. Assume given structure patterns in the query, i.e. structured queries, e.g. graph patterns (a popular fragment of widely used languages SQL and SPARQL)Blocking: iterator-based approachesNon-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-offLink data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting)This structure matching based on given structure patterns demonstrate the idea behind keyword  however query structure provide guidances as to what structure elements in the data are relevant, given keywords, all possible structured have to be explored, other kinds of join
  14. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  15. Followed from the excurse what about semantic features? Not directly incorporated into ranking models yet but only to generate candidate matches during the matching stepNot straightforward when using stastitical ranking models
  16. Proximity-based ranking employ minimal distance heuristics to maximize structural compactness of results When JRT is more compact, it is assumed to be more meaningful and relevant Intuition: keyword specified by the users are closely related and thus should be connected over relatively short paths I.e. Compactness measured in terms of the length of paths between nodes, i.e. The proximity The larger the length of paths, the less relevant is the overall resultThe proximity of two keywords defined based on proximity of elements matches these keywordsNi and nj are nodes in the graph sim(ni,nj) denotes the compactness between two any nodessim(ki,kj) denotes the compactness between two keywords (taking account the compactness of all pairs of nodes matching the two keywords), i.e. Cki denotes the set of all nodes that match kiOverall score of a JRT is an aggregation on the score of its n is a keyword search result  matching the query keywords
  17. Relational semantics instructured data is increasingly availableBUT QUALITYSome also result from manual data collection effortAutomatic methods can helpBut do not scale effectively to the large scale hetereogenous settingCombine structure with content? Ad-hoc aggregeation No principled approaches! Different semantics! Adding data after text is created, instead of subsequent enrichment; supporting creation, editing and maitaning hybrid content right at the start and throughout the process?