SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Effective XML KeywordEffective XML Keyword
Search with RelevanceSearch with Relevance
Oriented RankingOriented Ranking
Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu
1
Introduction
• XML Keyword search
– Inspired by IR style keyword search on the
web
– Enables user to access information in XML
database
– XML data modeled as a rooted, labeled tree
– Recent research efforts
• Efficiency
• Effectiveness
2
Effectiveness
• Capture user’s search intention
– Identify the target that user intends to search for
– Infer the predicate constraint that user intends to
search via
• Result ranking
–Rank the query results according to their
objective relevance to user search intention
3
State of the Art
• Search semantics design
– LCA (Lowest Common Ancestor)
• Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree
rooted at v contains at least one occurrence of all keywords in K,
after excluding the sub-elements that already contain all
keywords in K
– SLCA (Smallest LCA)
• Node v is a SLCA of keyword set K={w1, w2,…,wk} if
– (1) v is a LCA of K
– (2) no proper descendant of v is LCA of K
– XSeek
• Infers the search intention based on the concept of objects and
an analysis of the matching between keyword and data node
4
State of the Art (cont)
• Efficient result retrieval
– Designed based on a certain search semantics
– XKSearch, Multiway SLCA etc.
• Result ranking
– XRANK, XKSEarch, EASE
– They only consider
• Structural compactness of matching results
• Keyword proximity
• Similarity at node level
5
Problems Unaddressed
• Not address the user search intention
adequately!
– Meaningfulness of query result
• SLCA is less meaningful in many cases
– Keyword Ambiguity Problems
1. A keyword can appear both as an xml node type and as
the text value of some other nodes
2. A keyword can appear in the text values of different xml
node types and carry different meaningsNeither SLCA nor Xseek can well address keyword ambiguity
6
Meaningfulness
• Keyword query “rock music”
– Search intention: find customers interested in “rock music”
– C3
– SLCA returns: interest node of C3
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
Problems
7
Keyword Ambiguity
• Q = “customer, interest, art”
– Ambiguity 1: customer, interest; Ambiguity 2: art
– Intention: find customer whose interest is art
– less relevant or irrelevant result to be returned also --- C1,C3, B1’s title
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
8
Problems
Keyword Ambiguity (cont)
• Q = “customer, art”
– “art” can be the value of interest node(C2, C4), name node(C3), or
street node of customer(C1), or title node of book(B1)
– “customer” can be tag name of customer node, or (part of) value of
title of(B1)
- How to rank C1 to C4 and B1?
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
9
Problems
Objectives & Challenges
• Challenges
I. How to decide which sub-tree(s) with appropriate node types can
capture user desired information
II. How to return sub-trees of an appropriate size (i.e. contain enough
but non-overwhelming information)
III. How to rank those sub-trees by their relevance
• Address the below as a single problem
– Search intention identification
– Query result retrieval
– Result ranking
– Extend original TF*IDF from text database to XML database,
while capture the hierarchical structure of XML data
10
Challenges
Difficulty in applying TF*IDF to XML
XML DB carries semantic information while text DB
contains pure text information. XML TF*IDF must be
aware of the underlying semantics.
All contents of XML data are stored in leaf nodes only
What is analogy of “flat document” in XML?
o Sub-tree classified according to its prefix path
Normalization factor is not simply the size of sub-tree
o Structure of sub-trees may also infest the ranks
11
TF*IDF Recap
• Rule 1: A keyword appearing in many documents should
not be regarded as more important than a keyword
appearing in a few. --- IDF
• Rule 2: A document with more occurrences of a query
keyword should not be regarded as less important for
that keyword than a document that has less. --- TF
• Rule 3: A normalization factor is needed to balance
between long and short documents
– as Rule 2 discriminates against short documents which may
have less chance to contain more occurrences of keywords.
12
Our Approach
– Extend IR-style keyword search techniques (like TF*IDF)
from text database to XML database, in order to capture the
hierarchical structure of xml document
• by analyzing the knowledge of statistics of underlying XML data
– Major Contributions
1. Identify user’s desired search-for node and search-via node(s) in
a heuristic way
 Define XML TF (term frequency) and XML DF (document frequency)
 Confidence Formulas for search for/via candidates
2. Define XML TF*IDF Similarity
 Propose 3 guidelines specifically for xml keyword search
 Take keyword ambiguity problems into account
3. Design a Keyword Search Engine XReal 13
Data Model
• Node type - Two nodes are of same node type if they share the same
prefix path
/storeDB/customers/customer/name vs.
/storeDB/books/book/publisher/name
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
• Value node – text values contained in leaf node
• Structural node
 Single-valued node type, multi-valued node type
 Grouping type – all its children are of same multi-valued type
14
XML TF and IDF
• XML DF (document frequency)
– The number of T-typed nodes that contain keyword
k in their sub-trees in XML database.
• Granularity of similarity measurement is sub-trees of
certain node type T
• XML TF (term frequency)
– The number of occurrences of a keyword k in a
given value node a in XML database.
T
kf
,a kf
15
Infer the desired search-for node
• Guidelines: A node type T is considered as a desired
search for node if
1. T is intuitively related to every query keyword
2. XML nodes of type T should be informative enough to contain
enough relevant information
3. XML nodes of type T should be not overwhelming to contain too
much irrelevant information
• Confidence of T as the search for node w.r.t. query q.
• product instead of sum is used to follow 1st
guideline
• log part designed to follow 3rd
guideline
• exponential part designed to follow 2nd
guideline
• r is a decay factor in (0,1].
( )
( , ) log (1 )*T depth T
for e k
k q
C T q f r
∈
= +∏
16
Infer the Search-Via Nodes
• Infer structural node to search via
– Structural node n is a good candidate if it is related to as many
(but not necessarily all) keywords as possible
• Search via node type normally is not unique
• Infer individual value node to search via
– Statistics alone is not adequate to infer the likelihood of a value
node as (part of) search via node
– Capture keyword co-occurrence
( , ) log (1 )T
via e k
k q
C T q f
∈
= + ∑
17
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
• E.g. Q = “ customer, name, rock, interest, art ”
 Easy to find name and interest have high confidence to be the
search via nodes
 But hard to know rock is value of name or interest,
art is value of interest or name
How to differ customer C4 from
C3?
Capture keyword co-occurrence
18
Capture keyword co-occurrence
• Proximity factors for a value node v of type kt
containing keyword k
– Given a query q and a certain value node v, if there are two
keywords kt and k in q, s.t. kt matches the type of an
ancestor node of v and k matches a keyword in v
– In-Query distance
• Distance between keyword k and node type kt in query q
• Favors: kt appears before k
– Structural distance
• Depth distance between v and the nearest kt typed
ancestor node of v
– Value-Type distance
• Max of the above two
19
( )
1
( , , ) 1
( , , , )t
via
tk q ancType v
C q v k
Dist q v k k∈ ∩
= + ∑
Principles of XML keyword search
• Principle 1
– When searching for D-typed nodes via a single-valued type V,
ideally only the values and structures nested in V-typed nodes
can affect the relevance, regardless of the size of other typed
nodes nested in D-typed nodes.
• However, TF*IDF similarity in IR normalizes the relevance score of
each document w.r.t. its size
• Principle 2 – address keyword Ambiguity 2
– When searching for nodes of type D via a multi-valued type V’,
the relevance of a D-typed node which contains a query
relevant V’-typed node should not be affected (i.e. normalized)
too much by other query-irrelevant V’-typed nodes.
• Example: query “art” - C4 should not be less relevant than C1
20
Principles of XML keyword search
• Principle 1 and 2
– Especially useful for interpreting pure keyword query -
find search via node correctly
• Principle 3
– The order of keywords in a query is important to
indicate the search intention
• Incorporate the search via confidence Cvia we defined
before
21
XML TF*IDF Similarity
• To calculate the similarity between the search for
node and the query q
– Base case: similarity between value node a and q
• Apply original TF*IDF directly since a contains keywords
only without any structure
– Recursive case: similarity between structural node n
and q
• Based on similarities of its children c and the confidence
level of c as the node type to search via
( , )similarity q a =
,, *
*
Ta
a kq k
k q a
Ta
q a
W W
W W
∈ ∩
∑
IDF TFNormalization
factor
, ( , , )*ln(1 / (1 ))a a
a
T T
q k via T kW C q a k N f= + +
, ,1 ln( )a k a kW f= + 2
,( )a aT T
q q k
k q
W W
∈
= ∑ 2
,a a k
k a
W W
∈
= ∑
22
XML TF*IDF Similarity (cont.)
• Recursive Case
– Intuition 2. An internal node n is relevant to q, if n has a
child c such that the type of c has high confidence to be
a search via node w.r.t. q (i.e. large Cvia(Tc , q)), and c is
highly relevant to q (i.e. large sim(q, c)).
– Intuition 3. An internal node n is more relevant to q if n
has more query-relevant children when all others being
equal.
( )
( , )* ( , )
( , )
via c
c chd n
q
n
sim q c C T q
similarity q n
W
∈
=
∑
Weighted sum of all n’s
children’s similarity and their
confidence to be the search
via node
Overall weight of node n w.r.t
query q which essentially
plays the role of a
normalization factor 23
Flowchart of answering a query
1. Identify user search intention
– Compute the confidence of all possible candidate
node types and choose desired search for node Tfor
2. Relevance-oriented ranking
– Compute XML TF*IDF similarity in a bottom-up
approach from value nodes containing keywords up to
nodes of type Tfor
– Return a ranked list of sub-trees rooted at nodes of
type Tfor
• If more than one search for node type have comparable
confidence, a ranked list for each search for node is returned
24
Experimental Result
• Data set
– DBLP, XMark, WSU, eBay
• Comparison
– Compare XReal with SLCA, Xseek
• Equipment
– Implement in Java
– Run on 3.6GHz pentium IV, 1 GB memory PC with
Windows XP
– Berkeley DB java edition for storing keyword inverted
lists and keyword frequency table
25
Search Effectiveness
• Accuracy in inferring the search for node
– Conducted by user survey
– Tested queries contain at least one of the two
ambiguity problems
– Conclusion
• XReal works well, especially when the search for
node is not given explicitly in the query
26
Search Effectiveness
• Result effectiveness
– Measured by precision, recall, F-measure
– Observations
• XReal achieves higher precision than SLCA and
Xseek for queries that contain ambiguities
• XReal Performs as well as XSeek when queries
have no ambiguity in XML data
• XReal: Top-100 precision higher than overall
precision
• F-measure also shows good overall effectiveness
of both XReal and XSeek 27
Ranking Effectiveness
• Metrics
– Number of Top-1 answers that are relevant
– Reciprocal Rank (R-Rank)
– Mean Average Precision (MAP)
28
Efficiency & Scalability
• Compare three adoptions of indices for
XReal, and SLCA
– Dup
• Store only the dewey id and XML TF
– DupType
• Stores an extra node type (i.e. its prefix path)
– DupTypeNorm
• Stores an extra normalization factor Wa for value
node
,a kf
29
XMark DBLP
30
Q&A
Thank You
31
32
customers
storeDB
books
... ...book
title publisherID
authors
author
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
...
“Daniel Jones”
“John Williams”
book
title
...
ID
authors
authorauthor
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
...
......
... name
“Oxford”
“C2”
“C3”
“C4”
“B1”
“B2”

Weitere ähnliche Inhalte

Ähnlich wie Effective XML Keyword Search with Relevance Oriented Ranking

Schema Design
Schema DesignSchema Design
Schema Design
MongoDB
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
MW2014 Workshop - Intro to Linked Open Data
MW2014 Workshop - Intro to Linked Open DataMW2014 Workshop - Intro to Linked Open Data
MW2014 Workshop - Intro to Linked Open Data
David Henry
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
Andrew Brust
 
332 ch07
332 ch07332 ch07
332 ch07
YaQ10
 

Ähnlich wie Effective XML Keyword Search with Relevance Oriented Ranking (20)

Book of the Dead Project
Book of the Dead ProjectBook of the Dead Project
Book of the Dead Project
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
Schema Design
Schema DesignSchema Design
Schema Design
 
Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Schema Design
Schema DesignSchema Design
Schema Design
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data Hullabaloo
 
ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH) Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
 
Library Boot Camp: Basic Cataloging, Part 1
Library Boot Camp: Basic Cataloging, Part 1Library Boot Camp: Basic Cataloging, Part 1
Library Boot Camp: Basic Cataloging, Part 1
 
MW2014 Workshop - Intro to Linked Open Data
MW2014 Workshop - Intro to Linked Open DataMW2014 Workshop - Intro to Linked Open Data
MW2014 Workshop - Intro to Linked Open Data
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Semantic web
Semantic webSemantic web
Semantic web
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
Resource Description and Access at University of Zimbabwe
Resource Description and Access at University of ZimbabweResource Description and Access at University of Zimbabwe
Resource Description and Access at University of Zimbabwe
 
DB2 on Mainframe
DB2 on MainframeDB2 on Mainframe
DB2 on Mainframe
 
332 ch07
332 ch07332 ch07
332 ch07
 
Longwell final ppt
Longwell final pptLongwell final ppt
Longwell final ppt
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Effective XML Keyword Search with Relevance Oriented Ranking

  • 1. Effective XML KeywordEffective XML Keyword Search with RelevanceSearch with Relevance Oriented RankingOriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu 1
  • 2. Introduction • XML Keyword search – Inspired by IR style keyword search on the web – Enables user to access information in XML database – XML data modeled as a rooted, labeled tree – Recent research efforts • Efficiency • Effectiveness 2
  • 3. Effectiveness • Capture user’s search intention – Identify the target that user intends to search for – Infer the predicate constraint that user intends to search via • Result ranking –Rank the query results according to their objective relevance to user search intention 3
  • 4. State of the Art • Search semantics design – LCA (Lowest Common Ancestor) • Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K – SLCA (Smallest LCA) • Node v is a SLCA of keyword set K={w1, w2,…,wk} if – (1) v is a LCA of K – (2) no proper descendant of v is LCA of K – XSeek • Infers the search intention based on the concept of objects and an analysis of the matching between keyword and data node 4
  • 5. State of the Art (cont) • Efficient result retrieval – Designed based on a certain search semantics – XKSearch, Multiway SLCA etc. • Result ranking – XRANK, XKSEarch, EASE – They only consider • Structural compactness of matching results • Keyword proximity • Similarity at node level 5
  • 6. Problems Unaddressed • Not address the user search intention adequately! – Meaningfulness of query result • SLCA is less meaningful in many cases – Keyword Ambiguity Problems 1. A keyword can appear both as an xml node type and as the text value of some other nodes 2. A keyword can appear in the text values of different xml node types and carry different meaningsNeither SLCA nor Xseek can well address keyword ambiguity 6
  • 7. Meaningfulness • Keyword query “rock music” – Search intention: find customers interested in “rock music” – C3 – SLCA returns: interest node of C3 customers storeDB books ... ...book title publisherID authors author “B2” ... “Edward Martin” “Sophia Jones” author customer ID name interest interests ... “art”“Rock Davis” “C4” ... “Daniel Jones” “John Williams” book title ...ID authors author “B1” author “Art of Customer Interest Care” customer ID name address interest street city interests contact no. “1” “Art Street” ... ... “fashion” “Mary Smith” “C1” customer ID name interest interests “rock music” “Art Smith” “C3” purchase purchases customer ID name interest interests “street art”“John Martin” “C2” ... ...... ... name “Oxford” Problems 7
  • 8. Keyword Ambiguity • Q = “customer, interest, art” – Ambiguity 1: customer, interest; Ambiguity 2: art – Intention: find customer whose interest is art – less relevant or irrelevant result to be returned also --- C1,C3, B1’s title customers storeDB books ... ...book title publisherID authors author “B2” ... “Edward Martin” “Sophia Jones” author customer ID name interest interests ... “art”“Rock Davis” “C4” ... “Daniel Jones” “John Williams” book title ...ID authors author “B1” author “Art of Customer Interest Care” customer ID name address interest street city interests contact no. “1” “Art Street” ... ... “fashion” “Mary Smith” “C1” customer ID name interest interests “rock music” “Art Smith” “C3” purchase purchases customer ID name interest interests “street art”“John Martin” “C2” ... ...... ... name “Oxford” 8 Problems
  • 9. Keyword Ambiguity (cont) • Q = “customer, art” – “art” can be the value of interest node(C2, C4), name node(C3), or street node of customer(C1), or title node of book(B1) – “customer” can be tag name of customer node, or (part of) value of title of(B1) - How to rank C1 to C4 and B1? customers storeDB books ... ...book title publisherID authors author “B2” ... “Edward Martin” “Sophia Jones” author customer ID name interest interests ... “art”“Rock Davis” “C4” ... “Daniel Jones” “John Williams” book title ...ID authors author “B1” author “Art of Customer Interest Care” customer ID name address interest street city interests contact no. “1” “Art Street” ... ... “fashion” “Mary Smith” “C1” customer ID name interest interests “rock music” “Art Smith” “C3” purchase purchases customer ID name interest interests “street art”“John Martin” “C2” ... ...... ... name “Oxford” 9 Problems
  • 10. Objectives & Challenges • Challenges I. How to decide which sub-tree(s) with appropriate node types can capture user desired information II. How to return sub-trees of an appropriate size (i.e. contain enough but non-overwhelming information) III. How to rank those sub-trees by their relevance • Address the below as a single problem – Search intention identification – Query result retrieval – Result ranking – Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data 10
  • 11. Challenges Difficulty in applying TF*IDF to XML XML DB carries semantic information while text DB contains pure text information. XML TF*IDF must be aware of the underlying semantics. All contents of XML data are stored in leaf nodes only What is analogy of “flat document” in XML? o Sub-tree classified according to its prefix path Normalization factor is not simply the size of sub-tree o Structure of sub-trees may also infest the ranks 11
  • 12. TF*IDF Recap • Rule 1: A keyword appearing in many documents should not be regarded as more important than a keyword appearing in a few. --- IDF • Rule 2: A document with more occurrences of a query keyword should not be regarded as less important for that keyword than a document that has less. --- TF • Rule 3: A normalization factor is needed to balance between long and short documents – as Rule 2 discriminates against short documents which may have less chance to contain more occurrences of keywords. 12
  • 13. Our Approach – Extend IR-style keyword search techniques (like TF*IDF) from text database to XML database, in order to capture the hierarchical structure of xml document • by analyzing the knowledge of statistics of underlying XML data – Major Contributions 1. Identify user’s desired search-for node and search-via node(s) in a heuristic way  Define XML TF (term frequency) and XML DF (document frequency)  Confidence Formulas for search for/via candidates 2. Define XML TF*IDF Similarity  Propose 3 guidelines specifically for xml keyword search  Take keyword ambiguity problems into account 3. Design a Keyword Search Engine XReal 13
  • 14. Data Model • Node type - Two nodes are of same node type if they share the same prefix path /storeDB/customers/customer/name vs. /storeDB/books/book/publisher/name customers storeDB books ... ...book title publisherID authors author “B2” ... “Edward Martin” “Sophia Jones” author customer ID name interest interests ... “art”“Rock Davis” “C4” ... “Daniel Jones” “John Williams” book title ...ID authors author “B1” author “Art of Customer Interest Care” customer ID name address interest street city interests contact no. “1” “Art Street” ... ... “fashion” “Mary Smith” “C1” customer ID name interest interests “rock music” “Art Smith” “C3” purchase purchases customer ID name interest interests “street art”“John Martin” “C2” ... ...... ... name “Oxford” • Value node – text values contained in leaf node • Structural node  Single-valued node type, multi-valued node type  Grouping type – all its children are of same multi-valued type 14
  • 15. XML TF and IDF • XML DF (document frequency) – The number of T-typed nodes that contain keyword k in their sub-trees in XML database. • Granularity of similarity measurement is sub-trees of certain node type T • XML TF (term frequency) – The number of occurrences of a keyword k in a given value node a in XML database. T kf ,a kf 15
  • 16. Infer the desired search-for node • Guidelines: A node type T is considered as a desired search for node if 1. T is intuitively related to every query keyword 2. XML nodes of type T should be informative enough to contain enough relevant information 3. XML nodes of type T should be not overwhelming to contain too much irrelevant information • Confidence of T as the search for node w.r.t. query q. • product instead of sum is used to follow 1st guideline • log part designed to follow 3rd guideline • exponential part designed to follow 2nd guideline • r is a decay factor in (0,1]. ( ) ( , ) log (1 )*T depth T for e k k q C T q f r ∈ = +∏ 16
  • 17. Infer the Search-Via Nodes • Infer structural node to search via – Structural node n is a good candidate if it is related to as many (but not necessarily all) keywords as possible • Search via node type normally is not unique • Infer individual value node to search via – Statistics alone is not adequate to infer the likelihood of a value node as (part of) search via node – Capture keyword co-occurrence ( , ) log (1 )T via e k k q C T q f ∈ = + ∑ 17
  • 18. customers storeDB books ... ...book title publisherID authors author “B2” ... “Edward Martin” “Sophia Jones” author customer ID name interest interests ... “art”“Rock Davis” “C4” ... “Daniel Jones” “John Williams” book title ...ID authors author “B1” author “Art of Customer Interest Care” customer ID name address interest street city interests contact no. “1” “Art Street” ... ... “fashion” “Mary Smith” “C1” customer ID name interest interests “rock music” “Art Smith” “C3” purchase purchases customer ID name interest interests “street art”“John Martin” “C2” ... ...... ... name “Oxford” • E.g. Q = “ customer, name, rock, interest, art ”  Easy to find name and interest have high confidence to be the search via nodes  But hard to know rock is value of name or interest, art is value of interest or name How to differ customer C4 from C3? Capture keyword co-occurrence 18
  • 19. Capture keyword co-occurrence • Proximity factors for a value node v of type kt containing keyword k – Given a query q and a certain value node v, if there are two keywords kt and k in q, s.t. kt matches the type of an ancestor node of v and k matches a keyword in v – In-Query distance • Distance between keyword k and node type kt in query q • Favors: kt appears before k – Structural distance • Depth distance between v and the nearest kt typed ancestor node of v – Value-Type distance • Max of the above two 19 ( ) 1 ( , , ) 1 ( , , , )t via tk q ancType v C q v k Dist q v k k∈ ∩ = + ∑
  • 20. Principles of XML keyword search • Principle 1 – When searching for D-typed nodes via a single-valued type V, ideally only the values and structures nested in V-typed nodes can affect the relevance, regardless of the size of other typed nodes nested in D-typed nodes. • However, TF*IDF similarity in IR normalizes the relevance score of each document w.r.t. its size • Principle 2 – address keyword Ambiguity 2 – When searching for nodes of type D via a multi-valued type V’, the relevance of a D-typed node which contains a query relevant V’-typed node should not be affected (i.e. normalized) too much by other query-irrelevant V’-typed nodes. • Example: query “art” - C4 should not be less relevant than C1 20
  • 21. Principles of XML keyword search • Principle 1 and 2 – Especially useful for interpreting pure keyword query - find search via node correctly • Principle 3 – The order of keywords in a query is important to indicate the search intention • Incorporate the search via confidence Cvia we defined before 21
  • 22. XML TF*IDF Similarity • To calculate the similarity between the search for node and the query q – Base case: similarity between value node a and q • Apply original TF*IDF directly since a contains keywords only without any structure – Recursive case: similarity between structural node n and q • Based on similarities of its children c and the confidence level of c as the node type to search via ( , )similarity q a = ,, * * Ta a kq k k q a Ta q a W W W W ∈ ∩ ∑ IDF TFNormalization factor , ( , , )*ln(1 / (1 ))a a a T T q k via T kW C q a k N f= + + , ,1 ln( )a k a kW f= + 2 ,( )a aT T q q k k q W W ∈ = ∑ 2 ,a a k k a W W ∈ = ∑ 22
  • 23. XML TF*IDF Similarity (cont.) • Recursive Case – Intuition 2. An internal node n is relevant to q, if n has a child c such that the type of c has high confidence to be a search via node w.r.t. q (i.e. large Cvia(Tc , q)), and c is highly relevant to q (i.e. large sim(q, c)). – Intuition 3. An internal node n is more relevant to q if n has more query-relevant children when all others being equal. ( ) ( , )* ( , ) ( , ) via c c chd n q n sim q c C T q similarity q n W ∈ = ∑ Weighted sum of all n’s children’s similarity and their confidence to be the search via node Overall weight of node n w.r.t query q which essentially plays the role of a normalization factor 23
  • 24. Flowchart of answering a query 1. Identify user search intention – Compute the confidence of all possible candidate node types and choose desired search for node Tfor 2. Relevance-oriented ranking – Compute XML TF*IDF similarity in a bottom-up approach from value nodes containing keywords up to nodes of type Tfor – Return a ranked list of sub-trees rooted at nodes of type Tfor • If more than one search for node type have comparable confidence, a ranked list for each search for node is returned 24
  • 25. Experimental Result • Data set – DBLP, XMark, WSU, eBay • Comparison – Compare XReal with SLCA, Xseek • Equipment – Implement in Java – Run on 3.6GHz pentium IV, 1 GB memory PC with Windows XP – Berkeley DB java edition for storing keyword inverted lists and keyword frequency table 25
  • 26. Search Effectiveness • Accuracy in inferring the search for node – Conducted by user survey – Tested queries contain at least one of the two ambiguity problems – Conclusion • XReal works well, especially when the search for node is not given explicitly in the query 26
  • 27. Search Effectiveness • Result effectiveness – Measured by precision, recall, F-measure – Observations • XReal achieves higher precision than SLCA and Xseek for queries that contain ambiguities • XReal Performs as well as XSeek when queries have no ambiguity in XML data • XReal: Top-100 precision higher than overall precision • F-measure also shows good overall effectiveness of both XReal and XSeek 27
  • 28. Ranking Effectiveness • Metrics – Number of Top-1 answers that are relevant – Reciprocal Rank (R-Rank) – Mean Average Precision (MAP) 28
  • 29. Efficiency & Scalability • Compare three adoptions of indices for XReal, and SLCA – Dup • Store only the dewey id and XML TF – DupType • Stores an extra node type (i.e. its prefix path) – DupTypeNorm • Stores an extra normalization factor Wa for value node ,a kf 29
  • 32. 32 customers storeDB books ... ...book title publisherID authors author ... “Edward Martin” “Sophia Jones” author customer ID name interest interests ... “art”“Rock Davis” ... “Daniel Jones” “John Williams” book title ... ID authors authorauthor “Art of Customer Interest Care” customer ID name address interest street city interests contact no. “1” “Art Street” ... ... “fashion” “Mary Smith” “C1” customer ID name interest interests “rock music” “Art Smith” purchase purchases customer ID name interest interests “street art”“John Martin” ... ...... ... name “Oxford” “C2” “C3” “C4” “B1” “B2”

Hinweis der Redaktion

  1. The underlying reason is: all these SLCA-based approaches do not address the user search intention adequately. Here, in order not to confuse reader, you can understand it as the node type specified in DTD.
  2. Note that, the purpose of introducing “value node” is just to simplify the explanation and formula design in later sections, as leaf nodes play dual roles in xml document: (1) contain values; (2) carries tag name which can be viewed as a structural node. An xml node with a labeled name is called a “structural node”.
  3. 1. Note that the 2nd and 3rd guideline restrict each other, which is analogous to dilemma. An internal node at an appropriate height is most preferred. 2. In the above formula, r is some reduction factor with range (0,1] and is chosen to be 0.8 and show a good performance in our experiments.
  4. Mention: Actually most queries contain only value nodes without structural node. So after locating the appropriate structural node SN (even though when they do not occur in the keywords) as the search via node, it is also important to find the corresponding value node (if they are specified in the query keywords) associated with each SN (or find which one is more matchable to a given structural node). However, statistics alone cannot handle this matching job. That is to say, search engine cannot differentiate C4 and C3 in this case. In order to let it be able to differ these two, we take into account keywords co-occurrence into account when designing a well-formed confidence formula for value node; and this confidence formula will be incorporated into the TF*IDF similarity formula for base case.
  5. the pattern of keyword co-occurrence in a query provides a micro way to measure the likelihood of an individual value node to search via, as a compliment of statistics.
  6. Given a keyword query q and a certain value node v, if there are two keywords kt and k in q, such that kt matches the type of an ancestor node of v and k matches a keyword in v, then we define the following distances.
  7. These principles classify the difference on designing ranking functions between text database and xml database. Principle 1, in other words, denotes that the size of the subtree rooted at a D-typed node d (except the subtree rooted at the search via node) should not affect d’s relevance to the query.
  8. Principle 1 and 2 Look trivial when the search via node is explicit. We will incorporate all three principles into the design of XML TF*IDF formulas.
  9. Explain the reason we incorporate Cvia(q,a,k). Normalization factors play the role of balancing between XML leaf nodes containing many keywords and those with a few keywords.
  10. 1. The weighted sum in the numerator part follows closely to Intuition 2 and 3. 2. Besides, since Intuition 3 usually favors internal nodes with more children, we need to normalize the relevance of a to q. That naturally leads to the use of Wq,a (computed via Formula 14) as the denominator.