Semantic Web Search - Searching Documents and Semantic Data on the Web

Semantic Web Search
Searching Documents and Semantic Data on the Web
Presentation at Information Sciences Institute, USC
Semantic Search Group at the AIFB Institute
Thanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner,
Veli Bicer, Yongtao Ma and Rudi Studer.

http://sites.google.com/site/kimducthanh

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1

Structure

• Motivation
• Previous and current work
• Keyword query processing
• Keyword query result ranking
• Conclusion

2

Besides documents, there is an increasing amount of structured data on
the Web such as RDF, RDFa and Linked Data! How can we leverage this
for enhancing the search experience?

MOTIVATION

3

RDFa
…
<div about="/alice/posts/trouble_with_bob">
<h2 property="dc:title">The trouble with Bob</h2>
<h3 property="dc:creator">Alice</h3>

Bob is a good friend of mine. We went to the same university, and
also shared an apartment in Berlin in 2008. The trouble with Bob is
that he takes much better photos than I do:

<div about="http://example.com/bob/photos/sunset.jpg">
<img src="http://example.com/bob/photos/sunset.jpg" />
<span property="dc:title">Beautiful Sunset</span>
by <span property="dc:creator">Bob</span>.
</div>
</div>
…
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/

4

RDFa

Bob is a good friend of mine. We content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
content

5

Semantic Data

source: http://linkeddata.org/
6

Linked Data

7

Addressing Complex Information Needs
 “Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone in the field of Semantic Search
working at KIT”.

<shared apartment in Berlin with Alice> <knows someone in
the field of Semantic
<friend of Alice> Search working at KIT>
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend
Beautiful
of mine. We went to Sunset
the same university, Germany Semantic
Alice Search
and also shared an
apartment in Berlin
in 2008. The trouble
with Bob is that he Germany 2009
Bob
takes much better Thanh
photos than I do:
KIT
9

Data Sources in SemanticSearch@AIFB Demo

 English Wikipedia

 Data from Linked Open Data
 DBpedia
 YAGO
 Many more

 Live data from Data.gov (US Government)
 E.g. live data about earthquakes

10

Search Intent Interpretation, Refinement
and Exploration Keywords

Query
Completions

Term
Completions

Facets
Vorlesung Knowledge Discovery - Institut AIFB

13

Result Inspection, Analysis and Browsing

14

OVERVIEW OF WORK

15

Search Concepts
 Hybrid Search: Structured queries combined with
keywords on structured and unstructured data in
possibly remote (Linked Data) sources
BACK-END

 Query interpretation: Translation of keywords to
hybrid queries

 Keyword search (translated hybrid query)
combined with faceted search: starting with
keywords and then iterative refinement process
based on operations on facets
FRONT-END

16

Previous and Current Work

 Semi-structured RDF data management [ISWC09] [TKDE12]
 Inverted index for RDF data management
 Structure index
 Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11]
 Keyword query routing to find relevant sources / relevant
combination of sources
 “Explorative” query processing and adaptive query optimization
 Combining local and remote Linked Data
 Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11]
 Ontology and entity result summarization
 Faceted and keyword search
 Current work: hybrid data search

17 Tran Thanh: Schema-agnostic Search

KEYWORD QUERY PROCESSING
[ICDE09]
18

DB-style Keyword Search
Keyword query processing / translation
“Articles of researchers at Stanford with Turing Award” „Stanford Article Turing Award“

Specification

 Keywords might produce large number of
matching elements in the data graph
 The data graph might be large in size
 Search complexity increases substantially with
the size of the graph
 Large number of results

Selection Set of Queries Set of Results
1) Query 1 1) Result 1
2) Query 2 2) Result 2

19

Query Space
Schema graph Query space

 Main Idea
 Exploration on much reduced the data graph model
 Query space: more compact representation of
summary
 Online constructionspace space out of schema graph
called query of query
 Match keywords against labels of resources to find keyword elements
 Substantially elements with elements of schema to obtain query space
 Connect keyword decrease complexity

 Top-k procedure for graph exploration to compute
 Online top-k query graph exploration

only top-k results
20

Top-k Query Graph Exploration on Query Space
Paths and their costs The resulting query graph

• Cost-directed exploration of Steiner graphs
• Explore all possible distinct paths starting from keyword elements
• At each exploration, take current path with lowest cost
• When a connecting element is found, merge paths to construct the query
graph and add it to candidate list
• Top-k terminates when highest cost of the candidate list (the cost of the k-
ranked query graph) is found to be lower than the lowest possible cost that can
achieved with paths in the queues yet to be explored
• Result: best k query interpretations to be shown to the user

21

Evaluation – Performance
• Comparison with bidirectional search [V. Kacholia et al.] and
search based on graph indexing (1000 BFS, 1000 METIS, 300
BFS, 300 METIS in [H. He et al.])
• Query computation + processing time until finding 10 answers
• Outperforms bidirectional search by at least one order of magn.
• Performance comparable with indexing based approaches, but
requires less space
100000
10000 Our Solution

1000 Bidirect
1000 BFS
100
1000 METIS
10 300BFS
1 300METIS
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Query Performance on DBLP Data
22

KEYWORD QUERY RESULT RANKING
[CIKM11]
23

IR-based Ranking Schemes
 TF*IDF based:
 Discover, EASE, SPARK
 [Liu et al, SIGMOD06]

Score( JRT ) Score( r )
r JRT

Score(r ) Weight (v, r ) Weight (v, Q)
v r ,Q

ntf
Weight (v, r ) nidf
ntf 1 ln(1 ln(tf ))
ndl

ndl (1 s) s dl / avdl
N 1
nidf ln
df 24


24

Proximity-based Ranking Schemes

 EASE, XRANK, BLINKS, etc.
 EASE
 Proximity between a pair of keywords

 Overall score of a JRT is aggregation on the score of keyword pairs
 XRANK
 Ranking of XML documents / elements
 Proximity here is defined based on w, the smallest text window in
n that contains all search keywords

25

Prestige-based Ranking Schemes

 Based on graph structure, i.e. PageRank-like
methods to determine node prestige
 XRank [Guo et al, SIGMOD03]
 ObjectRank [Balmin et al, VLDB04] : considers both
global ObjectRank and keyword-specific ObjectRank
 The probability that edges of different types will be
visited are not uniform: requires manual fine-tuning to
set the importance of different types of edges
 Naive: indegree

26

Introduction
 Recent study shows that the effectiveness of most
works are below the expectations (Coffman and Weaver,
CIKM 2010)
 Problems:
 Proximity does not directly model relevance
 Ad-hoc TF/IDF normalization does not capture the nature
of keyword search results well (small document length,
skewed word occurrence statistics)
 PageRank not directly applicable

27

Overview of the Approach

 Keyword query is short an ambiguous, while data
(and results) provide rich structure information
that can be exploited!
 Principled approach to relevance based on
language models and PRF  estimate model from
content and structure of PRF results
 Adopt relevance model as a fine-grained model
representing both content and structure of
relevant document and queries (relevance class)

28

Relevance Models [SIGIR 01]
 Explicit notion of relevance
 Queries and documents are samples from a latent
representation space, i.e. the relevance model underlying
the information need

29

Relevance Models
q1 Israeli
sample probabilities
P(w|Q) w
M q2 Palestinian .077 palestinian
.055 israel
M q3 raids .034 jerusalem
M .033 protest
w ??? .027 raid
.011 clash
.010 bank
P( w, q1...qk ) .010 west
P( w | R) P( w | q1...qk ) .010 troop
P(q1...qk )
…

k
P ( w, q1...qk ) P( M ) P( w | M ) P (qi | M )
M UM i 1

30

Ranking with Relevance Models

 Probability ranking principle
P( D | R) P( w | R)
P( D | N ) w D P( w | N )

 See relevance model as query expansion
 Rank of document is based on the cross-entropy of its
model and the relevance model

H ( R || D) P ( w | R) log P( w | D)
w V

n( w, D)
P( w | D) D (1 D ) P( w | C )
|D|

31

Edge-Specific Relevance Models
 Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted
keyword index:
 E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}
 Based on PRF results, an edge specific relevance model is constructed for each unique
edge e based on:

32

Edge Specific Resource Models

 Edge-specific resource model:

 Smoothing with model for the entire resource
 The score of a resource calculated based on cross-entropy
of edge-specific RM and edge-specific ResM:

 Alpha allows to control the importance of edges

33

Ranking JRTs
 Ranking aggregated JRTs:
 The cross entropy between the edge-specific RM (Query Model) and
geometric mean of combined edge-specific ResM:

 The proposed ranking function is monotonic with respect to the
individual resource scores (a necessary property for using top-k
algorithms)

34

Experiments
 Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases
 Queries: 50 queries for each dataset including “TREC style” queries and
“single resource” queries
 Metrics: Three metrics are used: (1) the number of top-1 relevant
results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)
 Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,
CoveredDensity (TF-IDF).
 RM-S: Our approach

35

Experiments – Single Resource Queries
- Proximity-based approaches perform well
- Minimizing compactness results in single resources being ranked high
- TF-IDF normalization not as aggressive, not as effective

Reciprocal rank for single resource queries
36

Experiments – TREC-style Queries
- TF-IDF based approaches performed better
- Our approach outperformed existing approaches also in this category,
providing more stable performance over the entire precision-recall curve

Precision-recall for TREC-style queries on Wikipedia
37

Experiment – All Queries

- Our approach consistently shows superior performance
- Encouraging, given that this is first study that use a general
framework for evaluating keyword search ranking

MAP scores for all queries
38

Conclusions / Future Work

 Front-to-backend work on using structured data for
enhancing the search experience
 From backend data management to frontend search
concepts
 Current work / future directions
 Managing hybrid data
 Hybrid query processing / interfaces
 Ranking hybrid results

39

References (1)
 Günter Ladwig, Thanh Tran
SIHJoin: Querying Remote and Local Linked Data
In 8th Extended Semantic Web Conference (ESWC'11). Heraklion, Greece, June, 2011 (full
research paper, 23% acceptance rate).
 Thanh Tran, Lei Zhang, Rudi Studer
Summary Models for Routing Keywords to Linked Data Sources
In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
China, November, 2010 (full research paper, 20% acceptance rate).
 Günter Ladwig, Thanh Tran
Linked Data Query Processing Strategies
In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
China, November, 2010 (full research paper, 20% acceptance rate).
 Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer
Ontology-based Interpretation of Keywords for Semantic Search
In Proceedings of the 6th International Semantic Web Conference (ISWC'07), pp. 523-
536. Busan, Korea, November 2007 (full paper, 19% acceptance rate).

40

References (2)
 Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
In Proceedings of the 25th International Conference on Data Engineering
(ICDE'09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).
 Haofen Wang, Duc Thanh Tran, Chang Liu
CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support
In Proceedings of the 17th Conference on Information and Knowledge Management
(CIKM'08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).
 Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu,
Yue Pan
Semplore: A Scalable IR Approach to Search the Web of Data
In Journal of Web Semantics, 2009 (Impact Factor 3.4).
 Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu
Snippet Generation for Semantic Web Search Engines
In Proceedings of the 3rd Asian Semantic Web Conference (ASWC'08). December
2008 (full research paper, 31% acceptance rate).
 Thanh Tran, Günter Ladwig
Structure Index for RDF
In SemData@VLDB Workshop (SemData'10). Singapore, September, 2010.

41

Thanks!

Tran Duc Thanh
ducthanh.tran@kit.edu
http://sites.google.com/site/kimducthanh/

42

Backups

43

 Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search
over relational databases. In ICDE, pages 5-16.
 Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and
opportunities. In VLDB, page 1368.
 Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance
oriented ranking. In ICDE, pages 517-528.
 Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching
and Browsing in Databases using BANKS. In ICDE, pages 431-440.
 Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using
Relevance Models. In CIKM.
 Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009):
DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165
 Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data
graphs. PVLDB, 1(1):1189-1204.
 Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost
connected trees in databases. In ICDE, pages 836-845.
 Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data
graphs. In SIGMOD, pages 927-940.
 Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search
over XML documents. In SIGMOD.
 He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In
SIGMOD, pages 305-316.
 Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in
databases. ACM Trans. Database Syst., 33(1):1-40


 Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases.
In VLDB.
 Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005).
Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.
 Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword
proximity search. In PODS, pages 173-182.
 Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword
Search Databases. In CIKM.
 Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127.
 Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search
method for unstructured, semi-structured and structured data. In SIGMOD.
 Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational
databases. In SIGMOD, pages 563-574.
 Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational
databases. In SIGMOD, pages 115-126.
 Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD,
pages 681-694.
 Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across
heterogeneous relational databases. In ICDE, pages 346-355.
 Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search
Process. In Journal of Web Semantics, 2011.
 Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates
for Efficient Keyword Search on RDF. In ICDE.
 Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over
relational databases. In VLDB.

Semantic Web Search - Searching Documents and Semantic Data on the Web

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Semantic Web Search - Searching Documents and Semantic Data on the Web

Hinweis der Redaktion