This talk introduces the problem of matching web-scale entity graphs, such as multilingual name graphs and social network graphs, to solve difficult problems such as name translation or social id finding. While existing approaches focus on using textual (or phonetic) similarity or Web co-occurrences, this approach combines the strength of the two and significantly outperforms the state-of-the-arts. We present our evaluation results using real-life entity graphs.
Boost the utilization of your HCL environment by reevaluating use cases and f...
Seungwon Hwang: Entity Graph Mining and Matching
1. Information & Database Systems Lab
Entity Graph Mining and Matching
Seung-won Hwang
Associate Professor
Department of Computer Science and Engineering
POSTECH, Korea
2. Mining Human Intelligence from the Web: Click Graph
Language-agnostic/data-intensive: e.g., arabic Corpus?
Information & Database Systems Lab
Are q1 and q2 similar?
Are u3 and u4 similar?
3. Mining at Finer Granularity: Named Entity (NE) Graph
Person name, Place name, Organization name, Product name
Newspapers, Web sites, TV programs, …
Information & Database Systems Lab
Apple
MS
tenure
Co-founder
jobs
gates
complicated
Mac
4. Case I: Matching names with twitter accounts [EDBT11]
Information & Database Systems Lab
5. Case II: Entity Translation [EMNLP10,CIKM11]
What are the features?
How are the features combined?
(using translation as an application scenario)
Information & Database Systems Lab
NE NE
NE
NE
NE
NE NE
NE
NE
NE
NE NE NE NE
NE
NE
English NE
Chinese
Corpus NE
Corpus
NE
NE NE
NE
NE
NE NE
NE NE
Ge=(Ve, Ee) Gc=(Vc, Ec)
6. NE Translation
Goal
Finding a NE in source language into its NE in target language
Ex) “Obama” (English) “奥巴马” (Chinese)
Resources: comparable corpora
Information & Database Systems Lab
NEE NEE
Features Features
Find!!
NEE NEE
Features Features
Xinhua News Agency (English)
NEE NEC
NEE NEC
NEC NEC
NEE NEC
Features Features
NEC NEC NEE NEC
Features Features
Xinhua News Agency (Chinese)
7. NE Translation Similarity Features
Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]
Pronunciation similarity between named entities
Ex) “Obama” and “奥巴马” (pronounced Aobama)
Information & Database Systems Lab
Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]
Contextual word similarity between named entities
Ex) The president (总统) Obama (奥巴马)
“As president, Obama signed economic stimulus legislation …”
Relationship Similarity (R): G.-w.You [7]
Co-occurrence similarity between pairs of named entities
Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
8. Motivation
Taxonomy Table
Entity Relationship
Using Entity Names E [1,2,3] R You [7]
Information & Database Systems Lab
Using Textual Context EC [4,5,6] ?
Shao [8]
Research questions:
Why RC is not used?
Can all four categories combined?
9. In this paper…
We propose a new NE translation similarity feature
Relationship Context similarity (RC)
Contextual word similarity between named entities
Ex) pair (“Barack”, “Michelle”) Spouse
Information & Database Systems Lab
We propose new holistic approaches
Combining all E, EC, R, and RC
We validate our proposed approach using extensive
experiments
10. Our Framework
We abstract this problem as…
Graph Matching of two NE relationship graphs extracted from
comparable corpora
Information & Database Systems Lab
Populate a decision matrix
R, |Ve|-by-|Vc| matrix
NE NE
NE
NE
NE
NE NE
NE
NE
NE
NE NE NE NE
NE
NE
English NE
Chinese
Corpus NE
Corpus
NE
NE NE
NE
NE
NE NE
NE NE
Ge=(Ve, Ee) Gc=(Vc, Ec)
11. Our Framework
Overview – 3 Steps
Initialization
奥巴马 成龙
Construct NE relationship graphs
Build an initial pairwise similarity matrix R0 Obama .99 .1 .2
Information & Database Systems Lab
Use Entity (E) and Entity Context (EC) similarities
Jackie chan .1
Iterative reinforcement
Build a final pairwise similarity matrix R∞
Use Relationship (R) and Relationship Context (RC) similarities
Matching
Find 1:1 matching from R∞
奥巴马 成龙
Build a binary hard decision matrix R*
Obama .99 .1 .2
Jackie chan .99
12. Initialization
Constructing NE relationship graphs G = (N, E)
Extract NEs using entity tagger for each document in each corpus
Regard NEs that appears more than δ times as Nodes
Connect two Nodes when they co-occur more than δ times
Information & Database Systems Lab
Initializing R0
Computing entity similarity matrix SE
Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’
Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”)
E
ED(ei , PYC j )
S ij 1
Len(ei ) Len( PYC j )
13. Initialization
Initializing R0
Computing entity context similarity matrix SEC
Context word
Information & Database Systems Lab
ex) “As president, Obama signed economic stimulus legislation …”
Context window
CW ( NE , d ) {wi l/2 , wi l/2 1 ,..., wi ( NE ),..., wi l/2 1 , wi l/2 }
Correlation between a NE and a context word : Log-odd ratios
14. Initialization
Initializing R0
Computing entity context similarity matrix SEC
Projected Context Association Vector
Information & Database Systems Lab
Obama Score 奥巴马 Score
… … … …
President 0.9 … …
… … 总统 0.85
… … … …
Dictionary
USA
…
美
國
(President, 总统)
…
…
president 统总
15. Initialization
Initializing R0
Computing entity context similarity matrix SEC
Context Similarity between ‘ei’ and ‘cj’
Compute cosine similarity between two vectors
Information & Database Systems Lab
EC
CAei CAc j
S ij
CAei CAc j
Merging SE and SEC
Min-Max normalization in range [0:1]
Merge
Rij SijE SijEC
16. Reinforcement
Intuition
Two NEs with a strong relationship
Co-occur frequently have edge
Share similar context have similar relationship context
Information & Database Systems Lab
NE
NE
context
context
X
Y
context context
NE
NE
English NE Graph Chinese NE Graph
1. Align neighbors
using relationship (R) and relationship context (RC) similarity
2. Update the similarity score
17. Reinforcement
Iterative Approach
Relationship Context (RC) Similarity between
relation pair (i, u) and (j, v)
Information & Database Systems Lab
Relationship-based Similarity (R & RC) Entity-based Similarity (E & EC)
t RC
Ruv ( Siu , jv )
Rij 1
t
(1 0
) Rij
t
( u ,v ) k B ( i , j , ) 2k
Ordered set of aligned neighbor pairs of (i, j)
at iteration t
Relationship (R) Similarity of
i’s neighbor u and j’s neighbor v
18. Matching
Finding 1:1 matching using greedy algorithm
Steps
Information & Database Systems Lab
1. Find a translation pair with the highest final similarity score
2. Select the pair and remove the corresponding row and column from R∞
3. Repeat 1. and 2. until the similarity score < threshold
R∞
19. Experiments
Dataset
English Gigaword Corpus
Xinhua News Agency 2008.01~2008.12
100,746 news documents
Chinese Gigaword Corpus
Information & Database Systems Lab
Xinhua News Agency 2008.01~2008.12
88,029 news documents
Approaches
EC : consider Entity context similarity feature only
E : consider Entity name similarity feature only
Shao (E+EC) : combine Entity name & Entity Context similarities
You (E+R) : combine Entity name & Relationship similarities
Ours
E+EC+R (when ϒ = 0)
E+EC+R+RC
Measure
Precision, Recall, and F1-score
20. Experiments
Effectiveness of overall framework
500 person named entities
Set λ = 0.15
5-fold cross-validation for threshold parameter learning
Information & Database Systems Lab
Other type of NE (100 Location named entities)
21. Directions
Graph matching
Graph cleansing [VLDB11]
Scalable entity search
Information & Database Systems Lab
US Presidents
Bill Clinton
William J Clinton
George W. Bush
George H.W. Bush
Dubya
22. Thanks
Question?
Information & Database Systems Lab
Visit: www.postech.ac.kr/~swhwang for these papers