Weitere ähnliche Inhalte Kürzlich hochgeladen (20) Keyword Search on Structured Data using Relevance Models1. Keyword Search on Structured Data using
Relevance Models*
Veli Bicer
INFORMATIK
FZI Research Center for Information Technology
Karlsruhe, Germany
FZI FORSCHUNGSZENTRUM
Joint work with Thanh Tran from Semantic Search Group, AIFB
Institute, KIT
* based on the papers @ 20th ACM Conference on Information and Knowledge
Management (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11)
© FZI Forschungszentrum Informatik 1
2. About the presenter
Veli Bicer
Research Scientist at FZI Research Center for Information
Technology, Karlsruhe, Germany
Associated Researcher at Karlsruhe Service Research Institute (KSRI)
KSRI founded by IBM Germany
Research Interests
Semantic Data Management/Search
Relational Learning
Software Engineering (for Services)
Projects
German Internet Research Programme THESEUS
KOIOS Semantic Search in Core Technology Cluster
TEXO Internet-of-Services Use-case
Previously, EU ICT Artemis, Satine, Saphire and Ride
10.04.2012 © FZI Forschungszentrum Informatik 2
3. Agenda
Introduction
Keyword search on structured data
Relevance models
Approach
Ranking scheme using relevance models
Top-k Query processing
Experiments
Application
Search on environmental data
Conclusion
© FZI Forschungszentrum Informatik 3
4. INFORMATIK
FZI FORSCHUNGSZENTRUM
Introduction
10.04.2012 © FZI Forschungszentrum Informatik 4
5. Keyword Search on Structured Data
Rationale
4 billion web searches daily
Data-driven websites have relational database backend
Predefined search forms constrain retrieval
SQL difficult to learn
simplify data retrieval by not using SQL
© FZI Forschungszentrum Informatik 5
6. Keyword Search on Structured Data
Example
Who is the character played by Audrey Hepburn in Roman Holiday?
Query result Person Character
A tree of tuples that is reduced id name id name pid mid
with respect to the query. p1 Audrey Hepburn c1 Princess p1 m1
Ann
Which would you rather write? p3 Kate Winslet
c3 Iris p3 m2
… ………
Simpkins
SELECT C.name
… ……..
FROM Person, Character, Movie
WHERE Person.id = Character.pId Movie
AND Character.mid = Movie.id id title plot
AND Person.name = ‘Audrey Hepburn' m1 Roman Holiday Princess Ann is a royal princess
AND Movie.title = ‘Roman Holiday' ; of unknow of an …
m2 The Holiday Iris swaps her cottage for the
or “Hepburn Holiday” holiday along the next two …
m3 The Aviator Hughes and Hepburn go to a
holiday and fly together ..
… …… …..
© FZI Forschungszentrum Informatik 6
7. Keyword Search on Structured Data
Many approaches are proposed recently
Performance focus
Less consideration of ranking
Recent study (Coffman and Weaver, CIKM 2010)
effectiveness of previous works are below expectations
problem about ranking strategies, not performance
Two major types of ranking schemes:
IR-inspired TF-IDF ranking
(Liu et al, 2006) (SPARK, 2007)
Proximity based approaches
(Banks, 2002) (Bidirectional, 2005)
Problem:
Missing a robust and principled approach!!
© FZI Forschungszentrum Informatik 7
8. Relevance Models
Proposed by Lavrenko and Croft (SIGIR 01) Q D
Assumes that Classical Model
queries and documents are samples from a
hidden representation space and
generated from the same generative model
Initial representation of relevance is R
unknown
Estimated from query
Q D
Language Model
R
Q D
© FZI Forschungszentrum Informatik 8
Relevance Model
9. INFORMATIK
FZI FORSCHUNGSZENTRUM
Approach
10.04.2012 © FZI Forschungszentrum Informatik 9
10. Overview of Approach
1 Query
2 PRF
3 Query RM
4 Res. RM
words p words p words p
hepburn 0.5 hepburn 0.21 5 Res. Score
hepburn 0.12
holiday 0.5 holiday 0.15 holiday 0.18
audrey 0.13 audrey 0.11
katharine 0.09 D(RMQ||RMR) katharine 0.05
princess 0.01 princess 0.00
roman 0.01 roman 0.06
…. … …. …
Title Name
Roman Holiday Audrey
Hepburn
Breakfast at Tiff. Audrey
Hepburn
The Aviator Katharine
Hepbun
The Holiday Kate
Winslet
6 Query Generation 7 Structured Queries 8 Top-k Query Proc.
9 Result Ranking
© FZI Forschungszentrum Informatik 10
11. Data Model
Different kinds of data
e.g. relational, XML and RDF data
Data Graph of nodes and edges (G=(V,E))
Resource nodes, attribute nodes
Every resource is typed
Resources have unique ids, (e.g. primary keys)
10.04.2012 © FZI Forschungszentrum Informatik 11
12. Edge-Specific Relevance Models 1 2 3
A set of feedback resources FR are retrieved from an inverted keyword index:
E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3}
Edge-specific relevance model for each unique edge e:
Probability of word at resource
Importance of resource w.r.t. query
Inverted Index FR Edge-specific Relevance Models
princess m1, c1
breakfast m3 p1
name birthplace
hepburn
hepburn m3,p1,p4,c2
Audrey Hepburn Ixelles Belgium
melbourne p2
iris c3 m3
title The Holiday
holiday
holiday m1,m2,m3 plot
breakfast m3 Iris swaps her
cottage for the
ann m1,c2 holiday along the
next two …..
………. … ……. © FZI Forschungszentrum Informatik 12
13. Edge Specific Resource Models 4 5
Each resource (a tuple) is also represented as a RM
…as final results (joint tuples) are obtained by combining resources
Edge-specific resource model:
The score of resource: cross-entropy of edge-specific RM and
ResM:
© FZI Forschungszentrum Informatik 13
14. Smoothing
Well-known technique to address data sparseness and improve
accuracy of RMs (and LMs)
is the core probability for both query and resource RM
Local smoothing
Neighborhood of attribute a is another attribute a’:
a and a’ shares the same resources
resources of a and a’ are of the same type
resources of a and a’ are connected over a FK
Neighborhood of a
© FZI Forschungszentrum Informatik 14
15. Smoothing
words P name (v | p1 )
r a
Person Character audrey 0.5 0.4 0.37 0.36
type type type hepburn 0.5 0.4 0.39 0.38
pid_fk
p1 c1 ixelles 0.1 0.09 0.08
p4
birthplace belgium
name name 0.1 0.09 0.08
name
Audrey Hepburn Ixelles Belgium Princess Ann katharine 0.02 0.01
Katharine
Hepburn birthplace connecticut 0.02 0.01
Connecticut USA usa 0.02 0.01
princess
0.035
ann
0.035
Smoothing of each type is controlled
by weights:
where γ1 ,γ2 ,γ3 are control parameters
set in experiments
10.04.2012 © FZI Forschungszentrum Informatik 15
16. Ranking JRTs 9
Ranking aggregated JRTs:
Cross entropy between edge-specific RM (Query Model) and geometric
mean of combined edge-specific ResM:
The proposed score is monotonic w.r.t. individual resource scores
…a desired property for most of top-k algorithms
© FZI Forschungszentrum Informatik 16
17. Query Translation* 6 7
Mapping of keywords to data elements
Hepburn Hepburn Holiday Holiday
title
name name title
Result in a set of keyword elements p4 p1
m1
m3
Data Graph exploration type
type
Search for substructures (query graph) pid_fk
Character
Person mid_fk
connecting keyword elements
bornIn Movie
Bi-directional exploration of query
Is-a Location
graphs operates on summary of data hasDist
hasLoc
graph only Summary Producer Studio
Top-k computation
Graph worksFor
Search guided by a scoring function to Person Character Movie
output only the top-k queries type type type
pid_fk mid_fk
Query graphs to be processed name
?p ?c ?m
title
Free vs. Non-free variables Hepburn Holiday
*[Tran et al. ICDE’09]
© FZI Forschungszentrum Informatik 17
18. Top-k Query Processing 8
Top-k query processing (TQP) is highly common in Web-
accessible databases
return K highest-ranked answers
avoid unnecessary accesses to database
TQP assumes
Scoring function and attribute values to be known a-priori (e.g. RankJoin)
Combine attribute values by aggregation function
Sorted access (SA), random access (RA) probes
How to adapt TQP to return top-k relevant results?
Results are joined set of resources
Scores are query-dependent
No indexing is possible
Idea:
Retrieve resources for non-free variables and rank
Use SA on those initially retrieved resources
Use RA to find other resources
© FZI Forschungszentrum Informatik 18
19. Top-k Query Processing
Result candidate c=<(x1,…,xk),score>
complete when all variables are bound to some resources
xi =* indicates unbounded
Threshold
Binding operator 0.50
c’=(c,xiri)
Threshold determines upper bound for unseen resources
Scheduling between SA and RA
Tight bound is desired
Priority Queue
<(p1,*,*),0.50>
Person Character Movie <(*,*,m2),0.50>
type type type
pid_fk mid_fk
?p ?c ?m title
name
Hepburn Holiday
Person Character 0.11 Movie
id name S(r) id name S(r) id title S(r)
p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1
p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18
p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09
p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08
© FZI Forschungszentrum Informatik 19
20. Top-k Query Processing
Result candidate c=<(x1,…,xk),score>
complete when all variables are bound to some resources
xi =* indicates unbounded
Threshold
Binding operator 0.48
c’=(c,xiri)
Threshold determines upper bound for unseen resources
Scheduling between SA and RA
Tight bound is desired
Priority Queue
<(p1,*,*),0.50>
Person Character Movie <(*,*,m2),0.50>
type type type
pid_fk mid_fk <(p3,*,*),0.48>
?p ?c ?m title
name
Hepburn Holiday
Person Character 0.11 Movie
id name S(r) id name S(r) id title S(r)
p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1
p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18
p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09
p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08
© FZI Forschungszentrum Informatik 20
21. Top-k Query Processing
Result candidate c=<(x1,…,xk),score>
complete when all variables are bound to some resources
xi =* indicates unbounded
Threshold
Binding operator 0.47
c’=(c,xiri)
Threshold determines upper bound for unseen resources
Scheduling between SA and RA
Tight bound is desired
Priority Queue
<(*,*,m2),0.50>
Person Character Movie <(p1,c1,*),0.49>
type type type
pid_fk mid_fk <(p3,*,*),0.48>
?p ?c ?m title
name
Hepburn Holiday
Person Character 0.10 Movie
id name S(r) id name S(r) id title S(r)
p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1
p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18
p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09
p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08
© FZI Forschungszentrum Informatik 21
22. Top-k Query Processing
Result candidate c=<(x1,…,xk),score>
complete when all variables are bound to some resources
xi =* indicates unbounded
Threshold
Binding operator 0.46
c’=(c,xiri)
Threshold determines upper bound for unseen resources
Scheduling between SA and RA
Tight bound is desired
Priority Queue
<(p1,c1,*),0.49>
Person Character Movie <(p3,*,*),0.48>
type type type
pid_fk mid_fk <(*,c3,m2),0.44>
?p ?c ?m title
name
Hepburn Holiday
Person Character 0.09 Movie
id name S(r) id name S(r) id title S(r)
p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1
p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18
p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09
p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08
© FZI Forschungszentrum Informatik 22
23. Top-k Query Processing
Result candidate c=<(x1,…,xk),score>
complete when all variables are bound to some resources
xi =* indicates unbounded
Threshold
Binding operator 0.46
c’=(c,xiri)
Threshold determines upper bound for unseen resources
Scheduling between SA and RA
Tight bound is desired
Priority Queue
<(p3,*,*),0.48>
Person Character Movie <(*,c3,m2),0.44>
type type type
pid_fk mid_fk
?p ?c ?m title
name
Hepburn Holiday
Person Character 0.09 Movie
id name S(r) id name S(r) id title S(r)
p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1
p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 <(p1,c1,m1),0.48>
p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09
p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08
© FZI Forschungszentrum Informatik 23
24. INFORMATIK
FZI FORSCHUNGSZENTRUM
Experiments
© FZI Forschungszentrum Informatik 24
25. Experiments
Datasets: Subsets of Wikipedia, IMDB and Mondial Web
databases
Queries: 50 queries for each dataset including “TREC style”
queries and “single resource” queries
Metrics: Three metrics are used: (1) the number of top-1 relevant
results, (2) Reciprocal rank and (3) Mean Average Precision
(MAP)
Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,
CoveredDensity (TF-IDF).
RM-S: Our approach
© FZI Forschungszentrum Informatik 25
26. Experiments
MAP scores for all queries
Reciprocal rank for single
resource queries
© FZI Forschungszentrum Informatik 26
27. Experiments
Precision-recall for TREC-style queries on Wikipedia
© FZI Forschungszentrum Informatik 27
28. INFORMATIK
FZI FORSCHUNGSZENTRUM
Application
© FZI Forschungszentrum Informatik 28
29. Large amount of environmental data
Environmental issues stir public interests
Increase transparency, awareness, responsibility, protection
Growing amount of data
Public access through EU directive 2003/4/EC
PortalU (Germany) http://www.portalu.de/
EDP (UK) http://www.edp.nerc.ac.uk
Envirofacts (USA) http://www.epa.gov/enviro/index.html
Linking data in international context
Local government databases of environmental part of LOD cloud
Linked environment data for the life sciences
© FZI Forschungszentrum Informatik 29
30. Opportunity: mass dissemination and
consumption of environmental data
The percentage of people who actively find environmental
information is significantly lower than those who have those with
frequent access to it!
Complex results
CO emission values around Karlsruhe area in Germany
Analytics
CO emission values around Karlsruhe area in Germany
Sorted by year
Bar chart
Emission values of US and Germany
Compare average
Timeline visualization
© FZI Forschungszentrum Informatik 30
31. KOIOS – Overview
A semantic search system
Exploit semantics in the data for keywords interpretation to hide
complexity of query languages and data representation
Keyword search for searching structured data
Lower access barriers while enabling richness of data to be fully
harnessed
Contribution
Transfer research results to commercial EIS
Selector mechanism
Process
Input: keywords
Facet-based refinement
Selector (result and view template) initialization
Output: query results embedded in specific views
© FZI Forschungszentrum Informatik 31
33. Facets generation
Derive facets from query results (not from query!) for refinement
Attributes serve as facet categories
Attribute values as facet values
E.g. for ?s
Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”…
Value.year: 2005,2006,…
© FZI Forschungszentrum Informatik 33
34. Selectors
Selector: parameterized, predefined result and view templates
Data parameters: specify scope of information need, initialized to a
particular values based on facet categories and values
Query parameter: additional data processing for analysis tasks
(GROUP-BY, SORT, MIN, MAX, AVERAGE etc.)
Presentation parameter: visualization types (data value, data series,
data table, map-based, specific diagram type, etc.)
© FZI Forschungszentrum Informatik 34
35. Selector initialization
Selectors
capture templates for information needs and presentation of their
results
Map facets to selectors and initialize them
Applicable selectors: cover facet categories
Initialize selectors based on facet values
Initialized values are captured in the WHERE clause
Non-initialized parameters are included in the SELECT clause
© FZI Forschungszentrum Informatik 35
36. Deployment
Hippolytos project (Theseus)
Easy access to spatial data
warehouse (disy Cadenza) built for
domain of environmental
administration
Data about
Emission and waste
From the Baden-Württemberg
Provided by:
Umweltinformationssystem (UIS)
Baden-Württemberg, Landesamt für
Geoinformation und
Landentwicklung (LGL) Baden-
Württemberg and Statistisches
Landesamt Baden-Württemberg
© FZI Forschungszentrum Informatik 36
40. Conclusions
Keyword search on structured data is a popular problem for
which various solutions exist.
We focus on the aspect of result ranking, providing a principled
approach that employs relevance models.
Experiments show that RMs are promising for searching
structured data.
Top-k Query processing proposed to get only most relevant
results
Application on environmental data enables intuitive
Access
Visualization
Analysis of environmental information!
© FZI Forschungszentrum Informatik 40
41. INFORMATIK
FZI FORSCHUNGSZENTRUM
Thank you for your attention!
Questions?
42. Opportunity: mass dissemination and
consumption of environmental data
Increase transparency, awareness, responsibility, protection
© FZI Forschungszentrum Informatik 42
43. Challenges: intuitive access and visualization of
structured environmental data and analytics
The percentage of people who actively find environmental
information is significantly lower than those who have those
with frequent access to it!
Complex structured queries
Knowledge of the underlying data /
query language
Complex structured data
Heterogeneity and distribution of
environmental data is overwhelming
Complex structured results
Understanding results and
extracting relevant information /
analytics are difficult tasks
© FZI Forschungszentrum Informatik 43
44. KOIOS
Semantic search system, KOIOS, for intuitive access, analysis,
and visualization of structured environmental information
Overview and architecture
Structured query generation
from keywords
Facet-based browsing and
refinement
Selector initialization for final
result and view construction
Implementation and deployment
Conclusions
© FZI Forschungszentrum Informatik 44
45. Conclusions
Replace predefined forms and hard-coded visualization
Semantic search using lightweight semantics in data and
schema to dynamically
Translate keywords to queries
Generate facets for results
Initialize result and presentation templates
Enables intuitive
Access
Visualization
Analysis of environmental information!
© FZI Forschungszentrum Informatik 45
46. Inverted Index
princess m1, c1
breakfast m3
hepburn m3,p1,p4,c2
melbourne p2
iris c3
holiday m1,m2,m3
breakfast m3
ann m1,c2
………. … …….
04.04.2011 © FZI Forschungszentrum Informatik 49
47. Ranking Schemes
Proximity between keyword nodes
EASE:
XRank:
w is the smallest text window in n that contains all search keywords
2012-4-10
SIGMOD09 Tutorial 50
48. Ranking Schemes
Based on graph structure
BANKS
Nodes:
Edges :
PageRank-like methods
XRank [Guo et al, SIGMOD03]
ObjectRank [Balmin et al, VLDB04] : considers both
Global ObjectRank and Keyword-specific
ObjectRank
2012-4-10
SIGMOD09 Tutorial 51
49. Ranking Schemes
1 ln(1 ln(tf )) N 1
Score(n, Q) ln
w Q n (1 s ) s dl / avdl df
TF*IDF based:
Discover/EASE
[Liu et al, SIGMOD06]
SPARK
but not at the node level
2012-4-10
SIGMOD09 Tutorial 52
50. Relevance Models
Relevance sample probabilities
Model q1 P(w|Q) w
israeli
.077 palestinian
M q2 palestinian .055 israel
.034 jerusalem
M q3 raids .033 protest
M .027 raid
w ??? .011 clash
P(q | w) .010 bank
.010 west
P( w) .010 troop
P( w | q1...qk ) P(q | M ) P( M | w) …
P(q1...qk ) q M
P(q1...qk | w)
Hinweis der Redaktion Top-K Queries are a long studied topic in the database and information retrieval communitiesThe main objective of these queries is to return the K highest-ranked answers quickly and efficiently.A Top-K query returns the subset of most relevant answers, instead of ALL answers, for two reasons: i) to minimize the cost metric that is associated with the retrieval of all answers (e.g., disk, network, etc.)ii) to maximize the quality of the answer set, such that the user is not overwhelmed with irrelevant results