Semantic Web Search - Searching Documents and Semantic Data on the Web
1. Semantic Web Search
Searching Documents and Semantic Data on the Web
Presentation at Information Sciences Institute, USC
Semantic Search Group at the AIFB Institute
Thanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner,
Veli Bicer, Yongtao Ma and Rudi Studer.
http://sites.google.com/site/kimducthanh
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1
2. Structure
• Motivation
• Previous and current work
• Keyword query processing
• Keyword query result ranking
• Conclusion
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
2
3. Besides documents, there is an increasing amount of structured data on
the Web such as RDF, RDFa and Linked Data! How can we leverage this
for enhancing the search experience?
MOTIVATION
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
3
4. RDFa
…
<div about="/alice/posts/trouble_with_bob">
<h2 property="dc:title">The trouble with Bob</h2>
<h3 property="dc:creator">Alice</h3>
Bob is a good friend of mine. We went to the same university, and
also shared an apartment in Berlin in 2008. The trouble with Bob is
that he takes much better photos than I do:
<div about="http://example.com/bob/photos/sunset.jpg">
<img src="http://example.com/bob/photos/sunset.jpg" />
<span property="dc:title">Beautiful Sunset</span>
by <span property="dc:creator">Bob</span>.
</div>
</div>
…
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
4
5. RDFa
Bob is a good friend of mine. We content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
content
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
5
6. Semantic Data
source: http://linkeddata.org/
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
6
7. Linked Data
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
7
8. Addressing Complex Information Needs
“Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone in the field of Semantic Search
working at KIT”.
<shared apartment in Berlin with Alice> <knows someone in
the field of Semantic
<friend of Alice> Search working at KIT>
trouble with bob FluidOps 34
Peter
sunset.jpg
Bob is a good friend
Beautiful
of mine. We went to Sunset
the same university, Germany Semantic
Alice Search
and also shared an
apartment in Berlin
in 2008. The trouble
with Bob is that he Germany 2009
Bob
takes much better Thanh
photos than I do:
KIT
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
9
9. Data Sources in SemanticSearch@AIFB Demo
English Wikipedia
Data from Linked Open Data
DBpedia
YAGO
Many more
Live data from Data.gov (US Government)
E.g. live data about earthquakes
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
10
10. Search Intent Interpretation, Refinement
and Exploration Keywords
Query
Completions
Term
Completions
Facets
Vorlesung Knowledge Discovery - Institut AIFB
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
13
11. Result Inspection, Analysis and Browsing
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
14
12. OVERVIEW OF WORK
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
15
13. Search Concepts
Hybrid Search: Structured queries combined with
keywords on structured and unstructured data in
possibly remote (Linked Data) sources
BACK-END
Query interpretation: Translation of keywords to
hybrid queries
Keyword search (translated hybrid query)
combined with faceted search: starting with
keywords and then iterative refinement process
based on operations on facets
FRONT-END
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
16
14. Previous and Current Work
Semi-structured RDF data management [ISWC09] [TKDE12]
Inverted index for RDF data management
Structure index
Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11]
Keyword query routing to find relevant sources / relevant
combination of sources
“Explorative” query processing and adaptive query optimization
Combining local and remote Linked Data
Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11]
Ontology and entity result summarization
Faceted and keyword search
Current work: hybrid data search
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
17 Tran Thanh: Schema-agnostic Search
15. KEYWORD QUERY PROCESSING
[ICDE09]
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
18
16. DB-style Keyword Search
Keyword query processing / translation
“Articles of researchers at Stanford with Turing Award” „Stanford Article Turing Award“
Specification
Keywords might produce large number of
matching elements in the data graph
The data graph might be large in size
Search complexity increases substantially with
the size of the graph
Large number of results
Selection Set of Queries Set of Results
1) Query 1 1) Result 1
2) Query 2 2) Result 2
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
19
17. Query Space
Schema graph Query space
Main Idea
Exploration on much reduced the data graph model
Query space: more compact representation of
summary
Online constructionspace space out of schema graph
called query of query
Match keywords against labels of resources to find keyword elements
Substantially elements with elements of schema to obtain query space
Connect keyword decrease complexity
Top-k procedure for graph exploration to compute
Online top-k query graph exploration
only top-k results
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
20
18. Top-k Query Graph Exploration on Query Space
Paths and their costs The resulting query graph
• Cost-directed exploration of Steiner graphs
• Explore all possible distinct paths starting from keyword elements
• At each exploration, take current path with lowest cost
• When a connecting element is found, merge paths to construct the query
graph and add it to candidate list
• Top-k terminates when highest cost of the candidate list (the cost of the k-
ranked query graph) is found to be lower than the lowest possible cost that can
achieved with paths in the queues yet to be explored
• Result: best k query interpretations to be shown to the user
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
21
19. Evaluation – Performance
• Comparison with bidirectional search [V. Kacholia et al.] and
search based on graph indexing (1000 BFS, 1000 METIS, 300
BFS, 300 METIS in [H. He et al.])
• Query computation + processing time until finding 10 answers
• Outperforms bidirectional search by at least one order of magn.
• Performance comparable with indexing based approaches, but
requires less space
100000
10000 Our Solution
1000 Bidirect
1000 BFS
100
1000 METIS
10 300BFS
1 300METIS
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Query Performance on DBLP Data
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
22
20. KEYWORD QUERY RESULT RANKING
[CIKM11]
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
23
21. IR-based Ranking Schemes
TF*IDF based:
Discover, EASE, SPARK
[Liu et al, SIGMOD06]
Score( JRT ) Score( r )
r JRT
Score(r ) Weight (v, r ) Weight (v, Q)
v r ,Q
ntf
Weight (v, r ) nidf
ntf 1 ln(1 ln(tf ))
ndl
ndl (1 s) s dl / avdl
N 1
nidf ln
df 24
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
24
22. Proximity-based Ranking Schemes
EASE, XRANK, BLINKS, etc.
EASE
Proximity between a pair of keywords
Overall score of a JRT is aggregation on the score of keyword pairs
XRANK
Ranking of XML documents / elements
Proximity here is defined based on w, the smallest text window in
n that contains all search keywords
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
25
23. Prestige-based Ranking Schemes
Based on graph structure, i.e. PageRank-like
methods to determine node prestige
XRank [Guo et al, SIGMOD03]
ObjectRank [Balmin et al, VLDB04] : considers both
global ObjectRank and keyword-specific ObjectRank
The probability that edges of different types will be
visited are not uniform: requires manual fine-tuning to
set the importance of different types of edges
Naive: indegree
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
26
24. Introduction
Recent study shows that the effectiveness of most
works are below the expectations (Coffman and Weaver,
CIKM 2010)
Problems:
Proximity does not directly model relevance
Ad-hoc TF/IDF normalization does not capture the nature
of keyword search results well (small document length,
skewed word occurrence statistics)
PageRank not directly applicable
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
27
25. Overview of the Approach
Keyword query is short an ambiguous, while data
(and results) provide rich structure information
that can be exploited!
Principled approach to relevance based on
language models and PRF estimate model from
content and structure of PRF results
Adopt relevance model as a fine-grained model
representing both content and structure of
relevant document and queries (relevance class)
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
28
26. Relevance Models [SIGIR 01]
Explicit notion of relevance
Queries and documents are samples from a latent
representation space, i.e. the relevance model underlying
the information need
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
29
27. Relevance Models
q1 Israeli
sample probabilities
P(w|Q) w
M q2 Palestinian .077 palestinian
.055 israel
M q3 raids .034 jerusalem
M .033 protest
w ??? .027 raid
.011 clash
.010 bank
P( w, q1...qk ) .010 west
P( w | R) P( w | q1...qk ) .010 troop
P(q1...qk )
…
k
P ( w, q1...qk ) P( M ) P( w | M ) P (qi | M )
M UM i 1
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
30
28. Ranking with Relevance Models
Probability ranking principle
P( D | R) P( w | R)
P( D | N ) w D P( w | N )
See relevance model as query expansion
Rank of document is based on the cross-entropy of its
model and the relevance model
H ( R || D) P ( w | R) log P( w | D)
w V
n( w, D)
P( w | D) D (1 D ) P( w | C )
|D|
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
31
29. Edge-Specific Relevance Models
Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted
keyword index:
E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}
Based on PRF results, an edge specific relevance model is constructed for each unique
edge e based on:
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
32
30. Edge Specific Resource Models
Edge-specific resource model:
Smoothing with model for the entire resource
The score of a resource calculated based on cross-entropy
of edge-specific RM and edge-specific ResM:
Alpha allows to control the importance of edges
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
33
31. Ranking JRTs
Ranking aggregated JRTs:
The cross entropy between the edge-specific RM (Query Model) and
geometric mean of combined edge-specific ResM:
The proposed ranking function is monotonic with respect to the
individual resource scores (a necessary property for using top-k
algorithms)
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
34
32. Experiments
Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases
Queries: 50 queries for each dataset including “TREC style” queries and
“single resource” queries
Metrics: Three metrics are used: (1) the number of top-1 relevant
results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)
Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,
CoveredDensity (TF-IDF).
RM-S: Our approach
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
35
33. Experiments – Single Resource Queries
- Proximity-based approaches perform well
- Minimizing compactness results in single resources being ranked high
- TF-IDF normalization not as aggressive, not as effective
Reciprocal rank for single resource queries
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
36
34. Experiments – TREC-style Queries
- TF-IDF based approaches performed better
- Our approach outperformed existing approaches also in this category,
providing more stable performance over the entire precision-recall curve
Precision-recall for TREC-style queries on Wikipedia
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
37
35. Experiment – All Queries
- Our approach consistently shows superior performance
- Encouraging, given that this is first study that use a general
framework for evaluating keyword search ranking
MAP scores for all queries
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
38
36. Conclusions / Future Work
Front-to-backend work on using structured data for
enhancing the search experience
From backend data management to frontend search
concepts
Current work / future directions
Managing hybrid data
Hybrid query processing / interfaces
Ranking hybrid results
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
39
37. References (1)
Günter Ladwig, Thanh Tran
SIHJoin: Querying Remote and Local Linked Data
In 8th Extended Semantic Web Conference (ESWC'11). Heraklion, Greece, June, 2011 (full
research paper, 23% acceptance rate).
Thanh Tran, Lei Zhang, Rudi Studer
Summary Models for Routing Keywords to Linked Data Sources
In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
China, November, 2010 (full research paper, 20% acceptance rate).
Günter Ladwig, Thanh Tran
Linked Data Query Processing Strategies
In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
China, November, 2010 (full research paper, 20% acceptance rate).
Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer
Ontology-based Interpretation of Keywords for Semantic Search
In Proceedings of the 6th International Semantic Web Conference (ISWC'07), pp. 523-
536. Busan, Korea, November 2007 (full paper, 19% acceptance rate).
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
40
38. References (2)
Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
In Proceedings of the 25th International Conference on Data Engineering
(ICDE'09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).
Haofen Wang, Duc Thanh Tran, Chang Liu
CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support
In Proceedings of the 17th Conference on Information and Knowledge Management
(CIKM'08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).
Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu,
Yue Pan
Semplore: A Scalable IR Approach to Search the Web of Data
In Journal of Web Semantics, 2009 (Impact Factor 3.4).
Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu
Snippet Generation for Semantic Web Search Engines
In Proceedings of the 3rd Asian Semantic Web Conference (ASWC'08). December
2008 (full research paper, 31% acceptance rate).
Thanh Tran, Günter Ladwig
Structure Index for RDF
In SemData@VLDB Workshop (SemData'10). Singapore, September, 2010.
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
41
39. Thanks!
Tran Duc Thanh
ducthanh.tran@kit.edu
http://sites.google.com/site/kimducthanh/
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
42
40. Backups
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
43
41. Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search
over relational databases. In ICDE, pages 5-16.
Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and
opportunities. In VLDB, page 1368.
Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance
oriented ranking. In ICDE, pages 517-528.
Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching
and Browsing in Databases using BANKS. In ICDE, pages 431-440.
Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using
Relevance Models. In CIKM.
Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009):
DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165
Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data
graphs. PVLDB, 1(1):1189-1204.
Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost
connected trees in databases. In ICDE, pages 836-845.
Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data
graphs. In SIGMOD, pages 927-940.
Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search
over XML documents. In SIGMOD.
He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In
SIGMOD, pages 305-316.
Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in
databases. ACM Trans. Database Syst., 33(1):1-40
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
42. Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases.
In VLDB.
Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005).
Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.
Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword
proximity search. In PODS, pages 173-182.
Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword
Search Databases. In CIKM.
Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127.
Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search
method for unstructured, semi-structured and structured data. In SIGMOD.
Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational
databases. In SIGMOD, pages 563-574.
Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational
databases. In SIGMOD, pages 115-126.
Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD,
pages 681-694.
Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across
heterogeneous relational databases. In ICDE, pages 346-355.
Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search
Process. In Journal of Web Semantics, 2011.
Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates
for Efficient Keyword Search on RDF. In ICDE.
Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over
relational databases. In VLDB.
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Hinweis der Redaktion
Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Information need interpreted as a set of constrains Match structured data Match text
Togive an impressionwherewearetowardsaccomplishingthisgoal: demofirstOurcurrentsystem: Support theprocessofaddressingcomplexinformationneeds: startswithkeywordsearch: intepretingthequeryintentandthenbrowsing / exploration / refinementofresultsset via facetedsearch
- Upon selecting a specificresult: resource-basenavigation (insteadoffacetedbased)
TF-idf are used to deal with the textual part of the dataPropose to also exploit the structure of keyword search resultsProximity-based ranking employ minimal distance heuristics to maximize structural compactness of results When JRT is more compact, it is assumed to be more meaningful and relevant Intuition: keyword specified by the users are closely related and thus should be connected over relatively short paths I.e. Compactness measured in terms of the length of paths between nodes, i.e. The proximity The larger the length of paths, the less relevant is the overall resultNi and nj are nodes in the graph sim(ni,nj) denotes the compactness between two any nodessim(ki,kj) denotes the compactness between two keywords (taking account the compactness of all pairs of nodes matching the two keywords), i.e. Cki denotes the set of all nodes that match kiOverall score of a JRT is an aggregation on the score of its