SlideShare a Scribd company logo
1 of 22
Information & Database Systems Lab




                                     Entity Graph Mining and Matching
                                                                          Seung-won Hwang
                                                                         Associate Professor
                                             Department of Computer Science and Engineering
                                                                            POSTECH, Korea
Mining Human Intelligence from the Web: Click Graph
                                      Language-agnostic/data-intensive: e.g., arabic Corpus?
Information & Database Systems Lab




                                                                  Are q1 and q2 similar?




                                                                  Are u3 and u4 similar?
Mining at Finer Granularity: Named Entity (NE) Graph
                                      Person name, Place name, Organization name, Product name
                                        Newspapers, Web sites, TV programs, …
Information & Database Systems Lab




                                                                                             Apple
                                                                                                                 MS
                                                                                       tenure
                                                                                                          Co-founder
                                                                                            jobs
                                                                                                                 gates
                                                                                                   complicated

                                                                                            Mac
Case I: Matching names with twitter accounts [EDBT11]
Information & Database Systems Lab
Case II: Entity Translation [EMNLP10,CIKM11]
                                      What are the features?
                                      How are the features combined?
                                     (using translation as an application scenario)
Information & Database Systems Lab




                                                                 NE                                      NE
                                                                                                                   NE
                                                      NE
                                                                                               NE
                                                                                NE                            NE
                                                                      NE
                                                                                                                        NE
                                                NE
                                                            NE                       NE   NE        NE
                                                                           NE
                                                                                                                         NE
                                     English                                                                  NE
                                                                                                                              Chinese
                                     Corpus      NE
                                                                                                                              Corpus
                                                                                          NE
                                                                 NE                                 NE
                                                                                     NE

                                                                                                                        NE
                                                           NE                                                 NE
                                                                      NE                       NE



                                                            Ge=(Ve, Ee)                              Gc=(Vc, Ec)
NE Translation
                                      Goal
                                        Finding a NE in source language into its NE in target language
                                        Ex) “Obama” (English)  “奥巴马” (Chinese)
                                      Resources: comparable corpora
Information & Database Systems Lab




                                                                       NEE          NEE
                                                                         Features     Features
                                                                                                                Find!!
                                                                       NEE          NEE
                                                                         Features     Features

                                        Xinhua News Agency (English)
                                                                                                          NEE            NEC

                                                                                                          NEE            NEC
                                                                       NEC          NEC
                                                                                                          NEE            NEC
                                                                         Features     Features

                                                                       NEC          NEC                   NEE            NEC
                                                                         Features     Features

                                        Xinhua News Agency (Chinese)
NE Translation Similarity Features
                                      Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]
                                          Pronunciation similarity between named entities
                                          Ex) “Obama” and “奥巴马” (pronounced Aobama)
Information & Database Systems Lab




                                      Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]
                                          Contextual word similarity between named entities
                                          Ex) The president (总统) Obama (奥巴马)
                                              “As president, Obama signed economic stimulus legislation …”



                                      Relationship Similarity (R): G.-w.You [7]
                                          Co-occurrence similarity between pairs of named entities
                                          Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
Motivation
                                      Taxonomy Table

                                                                        Entity     Relationship
                                        Using Entity Names            E [1,2,3]         R         You [7]
Information & Database Systems Lab




                                        Using Textual Context         EC [4,5,6]        ?
                                                                      Shao [8]




                                     Research questions:
                                        Why RC is not used?
                                        Can all four categories combined?
In this paper…
                                      We propose a new NE translation similarity feature
                                         Relationship Context similarity (RC)
                                            Contextual word similarity between named entities
                                            Ex) pair (“Barack”, “Michelle”)  Spouse
Information & Database Systems Lab




                                      We propose new holistic approaches
                                            Combining all E, EC, R, and RC




                                      We validate our proposed approach using extensive
                                       experiments
Our Framework
                                      We abstract this problem as…
                                      Graph Matching of two NE relationship graphs extracted from
                                       comparable corpora
Information & Database Systems Lab




                                                                                                              Populate a decision matrix
                                                                                                                R, |Ve|-by-|Vc| matrix



                                                                NE                                      NE
                                                                                                                    NE
                                                     NE
                                                                                              NE
                                                                               NE                            NE
                                                                     NE
                                                                                                                         NE
                                               NE
                                                           NE                       NE   NE        NE
                                                                          NE
                                                                                                                          NE
                                     English                                                                 NE
                                                                                                                                    Chinese
                                     Corpus     NE
                                                                                                                                    Corpus
                                                                                         NE
                                                                NE                                 NE
                                                                                    NE

                                                                                                                         NE
                                                          NE                                                 NE
                                                                     NE                       NE



                                                           Ge=(Ve, Ee)                              Gc=(Vc, Ec)
Our Framework
                                      Overview – 3 Steps
                                        Initialization
                                                                                                                 奥巴马        成龙
                                            Construct NE relationship graphs
                                            Build an initial pairwise similarity matrix R0        Obama         .99   .1   .2
Information & Database Systems Lab




                                            Use Entity (E) and Entity Context (EC) similarities
                                                                                                   Jackie chan              .1
                                        Iterative reinforcement
                                            Build a final pairwise similarity matrix R∞
                                            Use Relationship (R) and Relationship Context (RC) similarities


                                        Matching
                                            Find 1:1 matching from R∞
                                                                                                                 奥巴马        成龙
                                            Build a binary hard decision matrix R*
                                                                                                   Obama         .99   .1   .2



                                                                                                   Jackie chan              .99
Initialization
                                      Constructing NE relationship graphs G = (N, E)
                                         Extract NEs using entity tagger for each document in each corpus
                                         Regard NEs that appears more than δ times as Nodes
                                         Connect two Nodes when they co-occur more than δ times
Information & Database Systems Lab




                                      Initializing R0
                                         Computing entity similarity matrix SE
                                             Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’
                                             Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”)


                                                                    E
                                                                                ED(ei , PYC j )
                                                               S   ij   1
                                                                            Len(ei ) Len( PYC j )
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Context word
Information & Database Systems Lab




                                               ex) “As president, Obama signed economic stimulus legislation …”




                                             Context window

                                               CW ( NE , d ) {wi   l/2   , wi   l/2 1   ,..., wi ( NE ),..., wi   l/2 1   , wi   l/2   }




                                             Correlation between a NE and a context word : Log-odd ratios
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Projected Context Association Vector
Information & Database Systems Lab




                                               Obama           Score                            奥巴马   Score
                                                 …              …                                …     …
                                              President         0.9                              …     …
                                                 …              …                               总统    0.85
                                                 …              …                                …     …



                                                                                Dictionary
                                     USA
                                                                                     …
                                                                                                美
                                                                                                國
                                                                              (President, 总统)
                                                                                     …
                                                                                     …


                                                          president                                           统总
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Context Similarity between ‘ei’ and ‘cj’
                                             Compute cosine similarity between two vectors
Information & Database Systems Lab




                                                                           EC
                                                                                CAei CAc j
                                                                      S   ij
                                                                                CAei    CAc j


                                         Merging SE and SEC
                                             Min-Max normalization in range [0:1]
                                             Merge


                                                                        Rij     SijE SijEC
Reinforcement
                                      Intuition
                                         Two NEs with a strong relationship
                                            Co-occur frequently                    have edge
                                            Share similar context                  have similar relationship context
Information & Database Systems Lab




                                                                                                       NE
                                                                        NE

                                                                                                      context
                                                                  context

                                                            X
                                                                                                                  Y



                                                                 context                                                  context


                                                                        NE
                                                                                                                                NE




                                                       English NE Graph                                      Chinese NE Graph
                                           1. Align neighbors
                                               using relationship (R) and relationship context (RC) similarity
                                           2. Update the similarity score
Reinforcement
                                      Iterative Approach

                                                 Relationship Context (RC) Similarity between
                                                 relation pair (i, u) and (j, v)
Information & Database Systems Lab




                                               Relationship-based Similarity (R & RC)                              Entity-based Similarity (E & EC)

                                                                                            t      RC
                                                                                           Ruv ( Siu , jv )
                                                     Rij 1
                                                       t
                                                                                                              (1           0
                                                                                                                       ) Rij
                                                                             t
                                                                ( u ,v ) k B ( i , j , )          2k


                                      Ordered set of aligned neighbor pairs of (i, j)
                                      at iteration t

                                                                                                   Relationship (R) Similarity of
                                                                                                   i’s neighbor u and j’s neighbor v
Matching
                                      Finding 1:1 matching using greedy algorithm

                                      Steps
Information & Database Systems Lab




                                       1.    Find a translation pair with the highest final similarity score
                                       2.    Select the pair and remove the corresponding row and column from R∞
                                       3.    Repeat 1. and 2. until the similarity score < threshold




                                        R∞
Experiments
                                      Dataset
                                        English Gigaword Corpus
                                            Xinhua News Agency 2008.01~2008.12
                                            100,746 news documents
                                        Chinese Gigaword Corpus
Information & Database Systems Lab




                                            Xinhua News Agency 2008.01~2008.12
                                            88,029 news documents


                                      Approaches
                                          EC                              : consider Entity context similarity feature only
                                          E                               : consider Entity name similarity feature only
                                          Shao (E+EC)                     : combine Entity name & Entity Context similarities
                                          You (E+R)                       : combine Entity name & Relationship similarities
                                          Ours
                                            E+EC+R (when ϒ = 0)
                                            E+EC+R+RC


                                      Measure
                                        Precision, Recall, and F1-score
Experiments
                                      Effectiveness of overall framework
                                         500 person named entities
                                         Set λ = 0.15
                                         5-fold cross-validation for threshold parameter learning
Information & Database Systems Lab




                                      Other type of NE (100 Location named entities)
Directions
                                      Graph matching
                                      Graph cleansing [VLDB11]
                                      Scalable entity search
Information & Database Systems Lab




                                                                  US Presidents
                                                                  Bill Clinton
                                                                  William J Clinton
                                                                  George W. Bush
                                                                  George H.W. Bush
                                                                  Dubya
Thanks
                                      Question?
Information & Database Systems Lab




                                     Visit: www.postech.ac.kr/~swhwang for these papers

More Related Content

More from Michael Shilman

Personal Desire / Design Fiction
Personal Desire / Design FictionPersonal Desire / Design Fiction
Personal Desire / Design FictionMichael Shilman
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Michael Shilman
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine LearningMichael Shilman
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionMichael Shilman
 

More from Michael Shilman (8)

Iterative Prototyping
Iterative PrototypingIterative Prototyping
Iterative Prototyping
 
Personal Desire / Design Fiction
Personal Desire / Design FictionPersonal Desire / Design Fiction
Personal Desire / Design Fiction
 
Data Design
Data DesignData Design
Data Design
 
Data Mining
Data MiningData Mining
Data Mining
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!
 
Class, where are we?
Class, where are we?Class, where are we?
Class, where are we?
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine Learning
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: Introduction
 

Recently uploaded

Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Delhi Call girls
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
 
Best Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaBest Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaShree Krishna Exports
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insightsseribangash
 
Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999
Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999
Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999Tina Ji
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Serviceritikaroy0888
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfPaul Menig
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
Best Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaBest Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in India
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insights
 
Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999
Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999
Russian Faridabad Call Girls(Badarpur) : ☎ 8168257667, @4999
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 

Seungwon Hwang: Entity Graph Mining and Matching

  • 1. Information & Database Systems Lab Entity Graph Mining and Matching Seung-won Hwang Associate Professor Department of Computer Science and Engineering POSTECH, Korea
  • 2. Mining Human Intelligence from the Web: Click Graph  Language-agnostic/data-intensive: e.g., arabic Corpus? Information & Database Systems Lab Are q1 and q2 similar? Are u3 and u4 similar?
  • 3. Mining at Finer Granularity: Named Entity (NE) Graph  Person name, Place name, Organization name, Product name  Newspapers, Web sites, TV programs, … Information & Database Systems Lab Apple MS tenure Co-founder jobs gates complicated Mac
  • 4. Case I: Matching names with twitter accounts [EDBT11] Information & Database Systems Lab
  • 5. Case II: Entity Translation [EMNLP10,CIKM11]  What are the features?  How are the features combined? (using translation as an application scenario) Information & Database Systems Lab NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  • 6. NE Translation  Goal  Finding a NE in source language into its NE in target language  Ex) “Obama” (English)  “奥巴马” (Chinese)  Resources: comparable corpora Information & Database Systems Lab NEE NEE Features Features Find!! NEE NEE Features Features Xinhua News Agency (English) NEE NEC NEE NEC NEC NEC NEE NEC Features Features NEC NEC NEE NEC Features Features Xinhua News Agency (Chinese)
  • 7. NE Translation Similarity Features  Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]  Pronunciation similarity between named entities  Ex) “Obama” and “奥巴马” (pronounced Aobama) Information & Database Systems Lab  Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]  Contextual word similarity between named entities  Ex) The president (总统) Obama (奥巴马) “As president, Obama signed economic stimulus legislation …”  Relationship Similarity (R): G.-w.You [7]  Co-occurrence similarity between pairs of named entities  Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
  • 8. Motivation  Taxonomy Table Entity Relationship Using Entity Names E [1,2,3] R You [7] Information & Database Systems Lab Using Textual Context EC [4,5,6] ? Shao [8] Research questions:  Why RC is not used?  Can all four categories combined?
  • 9. In this paper…  We propose a new NE translation similarity feature  Relationship Context similarity (RC)  Contextual word similarity between named entities  Ex) pair (“Barack”, “Michelle”)  Spouse Information & Database Systems Lab  We propose new holistic approaches  Combining all E, EC, R, and RC  We validate our proposed approach using extensive experiments
  • 10. Our Framework  We abstract this problem as…  Graph Matching of two NE relationship graphs extracted from comparable corpora Information & Database Systems Lab Populate a decision matrix R, |Ve|-by-|Vc| matrix NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  • 11. Our Framework  Overview – 3 Steps  Initialization 奥巴马 成龙  Construct NE relationship graphs  Build an initial pairwise similarity matrix R0 Obama .99 .1 .2 Information & Database Systems Lab  Use Entity (E) and Entity Context (EC) similarities Jackie chan .1  Iterative reinforcement  Build a final pairwise similarity matrix R∞  Use Relationship (R) and Relationship Context (RC) similarities  Matching  Find 1:1 matching from R∞ 奥巴马 成龙  Build a binary hard decision matrix R* Obama .99 .1 .2 Jackie chan .99
  • 12. Initialization  Constructing NE relationship graphs G = (N, E)  Extract NEs using entity tagger for each document in each corpus  Regard NEs that appears more than δ times as Nodes  Connect two Nodes when they co-occur more than δ times Information & Database Systems Lab  Initializing R0  Computing entity similarity matrix SE  Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’  Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”) E ED(ei , PYC j ) S ij 1 Len(ei ) Len( PYC j )
  • 13. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context word Information & Database Systems Lab ex) “As president, Obama signed economic stimulus legislation …”  Context window CW ( NE , d ) {wi l/2 , wi l/2 1 ,..., wi ( NE ),..., wi l/2 1 , wi l/2 }  Correlation between a NE and a context word : Log-odd ratios
  • 14. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Projected Context Association Vector Information & Database Systems Lab Obama Score 奥巴马 Score … … … … President 0.9 … … … … 总统 0.85 … … … … Dictionary USA … 美 國 (President, 总统) … … president 统总
  • 15. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context Similarity between ‘ei’ and ‘cj’  Compute cosine similarity between two vectors Information & Database Systems Lab EC CAei CAc j S ij CAei CAc j  Merging SE and SEC  Min-Max normalization in range [0:1]  Merge Rij SijE SijEC
  • 16. Reinforcement  Intuition  Two NEs with a strong relationship  Co-occur frequently  have edge  Share similar context  have similar relationship context Information & Database Systems Lab NE NE context context X Y context context NE NE English NE Graph Chinese NE Graph 1. Align neighbors using relationship (R) and relationship context (RC) similarity 2. Update the similarity score
  • 17. Reinforcement  Iterative Approach Relationship Context (RC) Similarity between relation pair (i, u) and (j, v) Information & Database Systems Lab Relationship-based Similarity (R & RC) Entity-based Similarity (E & EC) t RC Ruv ( Siu , jv ) Rij 1 t (1 0 ) Rij t ( u ,v ) k B ( i , j , ) 2k Ordered set of aligned neighbor pairs of (i, j) at iteration t Relationship (R) Similarity of i’s neighbor u and j’s neighbor v
  • 18. Matching  Finding 1:1 matching using greedy algorithm  Steps Information & Database Systems Lab 1. Find a translation pair with the highest final similarity score 2. Select the pair and remove the corresponding row and column from R∞ 3. Repeat 1. and 2. until the similarity score < threshold R∞
  • 19. Experiments  Dataset  English Gigaword Corpus  Xinhua News Agency 2008.01~2008.12  100,746 news documents  Chinese Gigaword Corpus Information & Database Systems Lab  Xinhua News Agency 2008.01~2008.12  88,029 news documents  Approaches  EC : consider Entity context similarity feature only  E : consider Entity name similarity feature only  Shao (E+EC) : combine Entity name & Entity Context similarities  You (E+R) : combine Entity name & Relationship similarities  Ours  E+EC+R (when ϒ = 0)  E+EC+R+RC  Measure  Precision, Recall, and F1-score
  • 20. Experiments  Effectiveness of overall framework  500 person named entities  Set λ = 0.15  5-fold cross-validation for threshold parameter learning Information & Database Systems Lab  Other type of NE (100 Location named entities)
  • 21. Directions  Graph matching  Graph cleansing [VLDB11]  Scalable entity search Information & Database Systems Lab US Presidents Bill Clinton William J Clinton George W. Bush George H.W. Bush Dubya
  • 22. Thanks  Question? Information & Database Systems Lab Visit: www.postech.ac.kr/~swhwang for these papers