SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
GOVERNMENT USERS
Conference
“Navigating the Human Terrain”
College Park, MD, May 20-21, 2008
Linguistic
Considerations of
Identity Resolution
David Murgatroyd
Software Architect
Basis Technology
2
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
3
Introduction: An Exercise
Jim Killeen Kileen, J. D.
Jaime Kilin
‫كلين‬ ‫جمس‬
 Is there a >50% chance these refer to the same
person? If…US Citizens; On a ferry to Spain;
In a documentary
4
What is Identity Resolution?
 Identity Resolution (aka Entity Resolution):
 determining if two or more given references refer to
the same entity.
 Different from name matching as it’s about
identity of entities not similarity of names
 See also:
 Murgatroyd, D. Some Linguistic Considerations of
Entity Resolution and Retrieval. In Proceedings of
LREC 2008 Workshop on Resources and Evaluation for
Identity Matching, Entity Resolution and Entity
Management.
5
What sorts of references?
 Non-linguistic reference examples:
 Numerical identifiers
— SSN
— Some portions of address (Street Number, Zip Code)
 Visual identifiers (e.g., pictures, symbols)
 Biometrics (e.g., DNA, iris, signature, voice)
 Linguistic reference examples:
 Nouns or pronouns in documents (e.g., “the CEO of Basis”)
 Names of associated/related entities
— Locations (e.g., Street or City Name)
— Organizations
— Individuals
 Name of entity <- we’re going to focus on this one
6
Let’s focus on names of people
 Common and familiar
 Often fairly identifying piece of personal
information
 Demonstrate typical challenges of resolution
with linguistic data
7
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
8
Variation (Intentional)
 Variation may be intentional
 References may be draw on a large set of names:
— Formality (e.g., nicknames)
— Transparency (e.g., aliases)
— Location (e.g., toponym)
— Life status
 Vocation (e.g., titles)
 Marital status (e.g., marriage/divorce/widowhood)
 Parenthood (e.g., patronymic)
 Faith (e.g., christening, pilgrimage)
 Death (e.g., posthumous names)
— Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”)
— Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn)
Jim Killeen
9
Variation (Unintentional)
 Variation may be unintentional, arising from:
 Typos
— E.g., “Killeen” vs. “Kileen”
 Guessing spelling based on pronunciation
— E.g., “Caliin”
 Ambiguities inherent in the encoding (e.g., Unicode):
— Characters with the same glyph
 E.g., Latin and Cyrillic small “i”
— Characters with similar glyphs
 E.g., Latin “K” and Greenlandic “ĸ”
— Characters with composed/combined forms
 E.g., ņ (n with cedilla) vs. ņ (n + combining cedilla)
Kileen, J. D.
10
Composition
 Names have differing orders:
 Given v. Surname: “Killen, Jim” v. “Jim Killeen”
 Varies by culture
 Name references may be partial:
 “Jim” v. “Jim Killeen”
11
Under-specification
 Name components may be abbreviated
 Initials (e.g., “J. D.”)
 Abbreviations (e.g., “Jas.”)
 Name references may have incomplete…
 orthography (e.g., Semitic languages)
 segmentation (e.g., Asian languages)
 phonology (e.g., Ideographic languages)
Kileen, J. D.
‫كلين‬ ‫جمس‬
12
Frequency
 Any person can make up a name (an open class)
 A few are common, most are very uncommon
 Zipfian distribution
 Lesson:
 Valuable to know
common names
 Valuable to have a
strategy for unknown
names
13
Multilinguality
 Names may appear in many languages-of-use
 This leads to variation at many linguistic levels.
 Orthographic:
 transliteration confronts skew in:
—orthographic-to-phonetic mappings of source and
target languages-of-use
—sound systems between the languages
‫كلين‬ ‫جمس‬ <-> James Klein
14
Multilinguality (cont’d)
 Syntactic:
 different languages-of-use may imply different name
word order
 Semantic:
 name words which communicate meaning (e.g.,
titles) may vary (e.g., “Jr.” for “‫الصغر‬ “which
means “the younger”)
 Pragmatic:
 different languages-of-use may use different names
based on the audience (e.g., “Mr. Laden” vs. “‫المير‬”
which means “the prince”)
15
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
16
Inputs & Outputs
 Inputs options include:
 Pair-wise: simple integration, but no shared effort
 Set-based: harder integration, but able to optimize
 Output options include:
 Feature-based: with weights/tuning
 Probability-based:
—more principled combination
—NOTE: similarity is not probability
17
Integration Properties
 Certain properties help make efficient
implementations:
 Reflexivity:
—Resolve(a,a) is always true
—NOTE: does not imply Resolve(a,a’) where a~a’
 Commutativity:
—Resolve(a,b)  Resolve(b,a)
 Transitivity:
—Resolve(a,b) & Resolve(b,c) => Resolve(a,c)
18
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
19
Corpora: Find or Build?
 Requirements:
 Annotated for ground truth
 Represent linguistic challenges
 Scalable/practical
 Options
 Adapt public “database” corpora:
— Wikipedia:
 Annotated: yes
 Representative: somewhat
 Scalable: yes
— Citation DBs:
 Annotated: no
 Representative: somewhat
 Scalable: yes
20
Corpora: Find or Build? (cont’d)
 Adapt public “document” corpora:
— Co-reference documents:
 Annotated: yes
 Representative: less as often single doc/language-of-use
 Scalable: yes
 Create corpora by hand:
— From scratch: “parrot sessions” (auditory or visual)
 Annotated: yes
 Representative: largely
 Scalable: no
— From un-annotated databases:
 Annotated: no
 Representative: yes
 Scalable/practical: no; databases may be private
— Synthesize from generative model
 Annotated: yes
 Representative: no, tied to generating model
 Scalable: yes
21
Metrics
 Back to our initial example
Jim Killeen Kileen, J. D.
Jaime Kilin
‫كلين‬ ‫جمس‬
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B
22
Metrics: Adopt or Create?
 How to quantify the quality of the system’s resolutions
vs. the reference?
 Goals:
 Discriminative: separates good v. bad systems for users’ needs
 Interpretable: number aligns with intuition
 Considerations:
 Assume transitive closure (TC) of output?
 Apply weights to try to be more discriminative?
 Common concepts:
 Precision: % of stuff in answer that’s right
 Recall: % of right stuff in answer
 F-Score: Harmonic mean of these = 2*P*R/(P+R)
23
Candidate Metrics
 Pair-wise % correct: over all N*(N-1)/2 node pairs
 Pair-wise P&R: based on links drawn
 Edit-distance: # of links to add/subtract to correct
 Metrics used in document co-reference resolution:
 MUC-6: entity-based P&R on missing links from graph
 B-CUBED: average per-reference P&R of links
 CEAF (Constrained Entity-Alignment F): entities aligned
using some similarity measure; P&R are % of possible
similarity level achieved
24
Comparing Metrics
Jim Killeen
Jaime Kilin
‫كلين‬ ‫جمس‬
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B
Kileen, J. D.
No TCTC
3
6
1
4
Edit-dist
81858973717982B
90788062618279A
No TCTCNo TCTC
CEAF
(TC)
B-CUBED
(TC)
MUC-6
(TC)
Pairwise F% Correct
My preference
25
Conclusion
 Identity resolution systems face linguistic
challenges
 They need to be carefully integrated to meet
these challenges
 Evaluation corpora should reflect these
challenges
 Evaluation metrics should align with qualitative
judgements
26
Bibliography
Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In
Proceedings of the First International Conference on Language Resources
and Evaluation Workshop on Linguistic Coreference.
Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the
American Statistical Association, Vol. 64, No. 328, pp. 1183--1210.
Luo, X. (2005). On coreference resolution performance metrics. In Proc. of
HLT-EMNLP, pp 25--32.
Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity
resolution with data confidences. In First International VLDB Workshop on
Clean Databases. Seoul, Korea.
Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and
Retrieval. In Proceedings of LREC 2008 Workshop on Resources and
Evaluation for Identity Matching, Entity Resolution and Entity
Management.
Spock Team (2008). The Spock Challenge. http://challenge.spock.com/
(Retrieved February 5.)
Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A
model-theoretic coreference scoring scheme. In Proceedings of the 6th
Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.
27
Questions?
More information:
http://www.basistech.com

Weitere ähnliche Inhalte

Andere mochten auch

Language acquisition
Language acquisitionLanguage acquisition
Language acquisitionYamuna Vijay
 
The interference of the first language
The interference of the first languageThe interference of the first language
The interference of the first languageMariam Nabilah
 
Age and language acquisition
Age and language acquisitionAge and language acquisition
Age and language acquisitionFariba Chamani
 
Language acquisition (2)
Language acquisition (2)Language acquisition (2)
Language acquisition (2)Clive McGoun
 
code switching
code switchingcode switching
code switchingnina s
 
Interference Between First and Second Languages pp pres
Interference Between First and Second Languages pp presInterference Between First and Second Languages pp pres
Interference Between First and Second Languages pp presMarcela Israelsky
 
Age and acquisition
Age and acquisitionAge and acquisition
Age and acquisitionSara Pacheco
 
Bilingualism, code switching, and code mixing
Bilingualism, code switching, and code mixingBilingualism, code switching, and code mixing
Bilingualism, code switching, and code mixingMuslimah Alg
 
Krashens Five Hypotheses
Krashens Five HypothesesKrashens Five Hypotheses
Krashens Five HypothesesJohn
 
Bilingualism
BilingualismBilingualism
BilingualismM R
 

Andere mochten auch (13)

Language acquisition
Language acquisitionLanguage acquisition
Language acquisition
 
The interference of the first language
The interference of the first languageThe interference of the first language
The interference of the first language
 
Age and acquisition
Age and acquisitionAge and acquisition
Age and acquisition
 
Age and language acquisition
Age and language acquisitionAge and language acquisition
Age and language acquisition
 
Language acquisition (2)
Language acquisition (2)Language acquisition (2)
Language acquisition (2)
 
code switching
code switchingcode switching
code switching
 
Interference Between First and Second Languages pp pres
Interference Between First and Second Languages pp presInterference Between First and Second Languages pp pres
Interference Between First and Second Languages pp pres
 
Age and acquisition
Age and acquisitionAge and acquisition
Age and acquisition
 
Bilingualism, code switching, and code mixing
Bilingualism, code switching, and code mixingBilingualism, code switching, and code mixing
Bilingualism, code switching, and code mixing
 
Krashens Five Hypotheses
Krashens Five HypothesesKrashens Five Hypotheses
Krashens Five Hypotheses
 
Code Switching
Code SwitchingCode Switching
Code Switching
 
Bilingualism
BilingualismBilingualism
Bilingualism
 
Krashen's Five Main Hypotheses
Krashen's Five Main Hypotheses Krashen's Five Main Hypotheses
Krashen's Five Main Hypotheses
 

Ähnlich wie Linguistic Considerations of Identity Resolution (2008)

Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420Katrina (Kate) Pugh
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly
 
Ich Bin Ein Website - The impact of culture and language on internationalization
Ich Bin Ein Website - The impact of culture and language on internationalizationIch Bin Ein Website - The impact of culture and language on internationalization
Ich Bin Ein Website - The impact of culture and language on internationalizationMolecular Inc
 
Testing vocabulary and literature
Testing vocabulary and literatureTesting vocabulary and literature
Testing vocabulary and literatureKurtz Candilas
 
MDG Seminar Presentation
MDG Seminar PresentationMDG Seminar Presentation
MDG Seminar Presentationjaedth
 
PVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out PresentationPVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out PresentationNicole James
 
Second Language Development through Writing: Considerations for the WIC Class...
Second Language Development through Writing: Considerations for the WIC Class...Second Language Development through Writing: Considerations for the WIC Class...
Second Language Development through Writing: Considerations for the WIC Class...Melanie Gonzalez
 
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...Dimika84
 
Tbl presentation
Tbl presentationTbl presentation
Tbl presentationgingerfresa
 
Written Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docxWritten Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docxjeffevans62972
 
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Andre Freitas
 
Assessing Higher-Order Thinking And Communication Skills In College Graduates...
Assessing Higher-Order Thinking And Communication Skills In College Graduates...Assessing Higher-Order Thinking And Communication Skills In College Graduates...
Assessing Higher-Order Thinking And Communication Skills In College Graduates...Sarah Marie
 
MNPS WIDA Transformations
MNPS WIDA TransformationsMNPS WIDA Transformations
MNPS WIDA Transformationsmollystovall
 
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular IdeasLean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular IdeasDavid Rico
 
Cultural Essay Examples
Cultural Essay ExamplesCultural Essay Examples
Cultural Essay ExamplesBrenda Thomas
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative researchGhulam Qambar
 
Patterns for learning in SL: Borrowing the Language of 2D design
Patterns for learning in SL: Borrowing the Language of 2D designPatterns for learning in SL: Borrowing the Language of 2D design
Patterns for learning in SL: Borrowing the Language of 2D designjeremykemp
 

Ähnlich wie Linguistic Considerations of Identity Resolution (2008) (20)

lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
 
Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
 
Ich Bin Ein Website - The impact of culture and language on internationalization
Ich Bin Ein Website - The impact of culture and language on internationalizationIch Bin Ein Website - The impact of culture and language on internationalization
Ich Bin Ein Website - The impact of culture and language on internationalization
 
Testing vocabulary and literature
Testing vocabulary and literatureTesting vocabulary and literature
Testing vocabulary and literature
 
MDG Seminar Presentation
MDG Seminar PresentationMDG Seminar Presentation
MDG Seminar Presentation
 
PVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out PresentationPVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out Presentation
 
Second Language Development through Writing: Considerations for the WIC Class...
Second Language Development through Writing: Considerations for the WIC Class...Second Language Development through Writing: Considerations for the WIC Class...
Second Language Development through Writing: Considerations for the WIC Class...
 
Analysis & Structure
Analysis & StructureAnalysis & Structure
Analysis & Structure
 
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
 
Tbl presentation
Tbl presentationTbl presentation
Tbl presentation
 
Written Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docxWritten Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docx
 
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
 
Assessing Higher-Order Thinking And Communication Skills In College Graduates...
Assessing Higher-Order Thinking And Communication Skills In College Graduates...Assessing Higher-Order Thinking And Communication Skills In College Graduates...
Assessing Higher-Order Thinking And Communication Skills In College Graduates...
 
MNPS WIDA Transformations
MNPS WIDA TransformationsMNPS WIDA Transformations
MNPS WIDA Transformations
 
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular IdeasLean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
 
Common core 2
Common core 2Common core 2
Common core 2
 
Cultural Essay Examples
Cultural Essay ExamplesCultural Essay Examples
Cultural Essay Examples
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
 
Patterns for learning in SL: Borrowing the Language of 2D design
Patterns for learning in SL: Borrowing the Language of 2D designPatterns for learning in SL: Borrowing the Language of 2D design
Patterns for learning in SL: Borrowing the Language of 2D design
 

Mehr von David Murgatroyd

Mission-Driven Machine Learning
Mission-Driven Machine LearningMission-Driven Machine Learning
Mission-Driven Machine LearningDavid Murgatroyd
 
Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)David Murgatroyd
 
Managing Your Machine Learning Portfolio
Managing Your Machine Learning PortfolioManaging Your Machine Learning Portfolio
Managing Your Machine Learning PortfolioDavid Murgatroyd
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product ownerDavid Murgatroyd
 
Technology & Faith: from Coding to Culture
Technology & Faith: from Coding to CultureTechnology & Faith: from Coding to Culture
Technology & Faith: from Coding to CultureDavid Murgatroyd
 
Choosing a Job for the Right Reasons
Choosing a Job for the Right ReasonsChoosing a Job for the Right Reasons
Choosing a Job for the Right ReasonsDavid Murgatroyd
 
System combination for HLT
System combination for HLTSystem combination for HLT
System combination for HLTDavid Murgatroyd
 
Simple fuzzy name matching in solr
Simple fuzzy name matching in solrSimple fuzzy name matching in solr
Simple fuzzy name matching in solrDavid Murgatroyd
 
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...David Murgatroyd
 
From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013David Murgatroyd
 

Mehr von David Murgatroyd (13)

Mission-Driven Machine Learning
Mission-Driven Machine LearningMission-Driven Machine Learning
Mission-Driven Machine Learning
 
Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)
 
Managing Your Machine Learning Portfolio
Managing Your Machine Learning PortfolioManaging Your Machine Learning Portfolio
Managing Your Machine Learning Portfolio
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product owner
 
Technology & Faith: from Coding to Culture
Technology & Faith: from Coding to CultureTechnology & Faith: from Coding to Culture
Technology & Faith: from Coding to Culture
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
 
Choosing a Job for the Right Reasons
Choosing a Job for the Right ReasonsChoosing a Job for the Right Reasons
Choosing a Job for the Right Reasons
 
NLP in the Real World
NLP in the Real WorldNLP in the Real World
NLP in the Real World
 
System combination for HLT
System combination for HLTSystem combination for HLT
System combination for HLT
 
HltCon overview
HltCon overviewHltCon overview
HltCon overview
 
Simple fuzzy name matching in solr
Simple fuzzy name matching in solrSimple fuzzy name matching in solr
Simple fuzzy name matching in solr
 
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
 
From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013
 

Kürzlich hochgeladen

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 

Kürzlich hochgeladen (20)

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 

Linguistic Considerations of Identity Resolution (2008)

  • 1. GOVERNMENT USERS Conference “Navigating the Human Terrain” College Park, MD, May 20-21, 2008 Linguistic Considerations of Identity Resolution David Murgatroyd Software Architect Basis Technology
  • 2. 2 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 3. 3 Introduction: An Exercise Jim Killeen Kileen, J. D. Jaime Kilin ‫كلين‬ ‫جمس‬  Is there a >50% chance these refer to the same person? If…US Citizens; On a ferry to Spain; In a documentary
  • 4. 4 What is Identity Resolution?  Identity Resolution (aka Entity Resolution):  determining if two or more given references refer to the same entity.  Different from name matching as it’s about identity of entities not similarity of names  See also:  Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management.
  • 5. 5 What sorts of references?  Non-linguistic reference examples:  Numerical identifiers — SSN — Some portions of address (Street Number, Zip Code)  Visual identifiers (e.g., pictures, symbols)  Biometrics (e.g., DNA, iris, signature, voice)  Linguistic reference examples:  Nouns or pronouns in documents (e.g., “the CEO of Basis”)  Names of associated/related entities — Locations (e.g., Street or City Name) — Organizations — Individuals  Name of entity <- we’re going to focus on this one
  • 6. 6 Let’s focus on names of people  Common and familiar  Often fairly identifying piece of personal information  Demonstrate typical challenges of resolution with linguistic data
  • 7. 7 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 8. 8 Variation (Intentional)  Variation may be intentional  References may be draw on a large set of names: — Formality (e.g., nicknames) — Transparency (e.g., aliases) — Location (e.g., toponym) — Life status  Vocation (e.g., titles)  Marital status (e.g., marriage/divorce/widowhood)  Parenthood (e.g., patronymic)  Faith (e.g., christening, pilgrimage)  Death (e.g., posthumous names) — Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”) — Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn) Jim Killeen
  • 9. 9 Variation (Unintentional)  Variation may be unintentional, arising from:  Typos — E.g., “Killeen” vs. “Kileen”  Guessing spelling based on pronunciation — E.g., “Caliin”  Ambiguities inherent in the encoding (e.g., Unicode): — Characters with the same glyph  E.g., Latin and Cyrillic small “i” — Characters with similar glyphs  E.g., Latin “K” and Greenlandic “ĸ” — Characters with composed/combined forms  E.g., ņ (n with cedilla) vs. ņ (n + combining cedilla) Kileen, J. D.
  • 10. 10 Composition  Names have differing orders:  Given v. Surname: “Killen, Jim” v. “Jim Killeen”  Varies by culture  Name references may be partial:  “Jim” v. “Jim Killeen”
  • 11. 11 Under-specification  Name components may be abbreviated  Initials (e.g., “J. D.”)  Abbreviations (e.g., “Jas.”)  Name references may have incomplete…  orthography (e.g., Semitic languages)  segmentation (e.g., Asian languages)  phonology (e.g., Ideographic languages) Kileen, J. D. ‫كلين‬ ‫جمس‬
  • 12. 12 Frequency  Any person can make up a name (an open class)  A few are common, most are very uncommon  Zipfian distribution  Lesson:  Valuable to know common names  Valuable to have a strategy for unknown names
  • 13. 13 Multilinguality  Names may appear in many languages-of-use  This leads to variation at many linguistic levels.  Orthographic:  transliteration confronts skew in: —orthographic-to-phonetic mappings of source and target languages-of-use —sound systems between the languages ‫كلين‬ ‫جمس‬ <-> James Klein
  • 14. 14 Multilinguality (cont’d)  Syntactic:  different languages-of-use may imply different name word order  Semantic:  name words which communicate meaning (e.g., titles) may vary (e.g., “Jr.” for “‫الصغر‬ “which means “the younger”)  Pragmatic:  different languages-of-use may use different names based on the audience (e.g., “Mr. Laden” vs. “‫المير‬” which means “the prince”)
  • 15. 15 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 16. 16 Inputs & Outputs  Inputs options include:  Pair-wise: simple integration, but no shared effort  Set-based: harder integration, but able to optimize  Output options include:  Feature-based: with weights/tuning  Probability-based: —more principled combination —NOTE: similarity is not probability
  • 17. 17 Integration Properties  Certain properties help make efficient implementations:  Reflexivity: —Resolve(a,a) is always true —NOTE: does not imply Resolve(a,a’) where a~a’  Commutativity: —Resolve(a,b)  Resolve(b,a)  Transitivity: —Resolve(a,b) & Resolve(b,c) => Resolve(a,c)
  • 18. 18 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 19. 19 Corpora: Find or Build?  Requirements:  Annotated for ground truth  Represent linguistic challenges  Scalable/practical  Options  Adapt public “database” corpora: — Wikipedia:  Annotated: yes  Representative: somewhat  Scalable: yes — Citation DBs:  Annotated: no  Representative: somewhat  Scalable: yes
  • 20. 20 Corpora: Find or Build? (cont’d)  Adapt public “document” corpora: — Co-reference documents:  Annotated: yes  Representative: less as often single doc/language-of-use  Scalable: yes  Create corpora by hand: — From scratch: “parrot sessions” (auditory or visual)  Annotated: yes  Representative: largely  Scalable: no — From un-annotated databases:  Annotated: no  Representative: yes  Scalable/practical: no; databases may be private — Synthesize from generative model  Annotated: yes  Representative: no, tied to generating model  Scalable: yes
  • 21. 21 Metrics  Back to our initial example Jim Killeen Kileen, J. D. Jaime Kilin ‫كلين‬ ‫جمس‬ Jim JDKJimK illeen J. Diw Killeen Reference System A System B
  • 22. 22 Metrics: Adopt or Create?  How to quantify the quality of the system’s resolutions vs. the reference?  Goals:  Discriminative: separates good v. bad systems for users’ needs  Interpretable: number aligns with intuition  Considerations:  Assume transitive closure (TC) of output?  Apply weights to try to be more discriminative?  Common concepts:  Precision: % of stuff in answer that’s right  Recall: % of right stuff in answer  F-Score: Harmonic mean of these = 2*P*R/(P+R)
  • 23. 23 Candidate Metrics  Pair-wise % correct: over all N*(N-1)/2 node pairs  Pair-wise P&R: based on links drawn  Edit-distance: # of links to add/subtract to correct  Metrics used in document co-reference resolution:  MUC-6: entity-based P&R on missing links from graph  B-CUBED: average per-reference P&R of links  CEAF (Constrained Entity-Alignment F): entities aligned using some similarity measure; P&R are % of possible similarity level achieved
  • 24. 24 Comparing Metrics Jim Killeen Jaime Kilin ‫كلين‬ ‫جمس‬ Jim JDKJimK illeen J. Diw Killeen Reference System A System B Kileen, J. D. No TCTC 3 6 1 4 Edit-dist 81858973717982B 90788062618279A No TCTCNo TCTC CEAF (TC) B-CUBED (TC) MUC-6 (TC) Pairwise F% Correct My preference
  • 25. 25 Conclusion  Identity resolution systems face linguistic challenges  They need to be carefully integrated to meet these challenges  Evaluation corpora should reflect these challenges  Evaluation metrics should align with qualitative judgements
  • 26. 26 Bibliography Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In Proceedings of the First International Conference on Language Resources and Evaluation Workshop on Linguistic Coreference. Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, Vol. 64, No. 328, pp. 1183--1210. Luo, X. (2005). On coreference resolution performance metrics. In Proc. of HLT-EMNLP, pp 25--32. Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In First International VLDB Workshop on Clean Databases. Seoul, Korea. Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management. Spock Team (2008). The Spock Challenge. http://challenge.spock.com/ (Retrieved February 5.) Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.