This presentation on Named Entity Evolution Recognition is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
1. NEER: An Unsupervised Method for
Named Entity Evolution Recognition
(Advanced Level)
Prerequisite: NEER: An Unsupervised Method for Named Entity
Evolution Recognition (Beginner Level)
Nina N. Tahmasebi, Thomas Risse
L3S Research Center
Hannover, Germany
risse@L3s.de
2. Change Period
Named Entity Evolution
Named Entities (NE): people, places, companies...
Characteristics of Named Entity Evolution (NEE)
Same thing but different terms over time
Change occurs over short periods of time
Small or no concept shift
Announced to the public repeatedly
Goal: Find method for named entity evolution
recognition independent from external
knowledge sources
Slide 2
Joseph Ratzinger Pope Benedict
Pope Benedict XVI
Benedict XVI
Joseph Aloisius Ratzinger
Cardinal Ratzinger
Cardinal Joseph Ratzinger
3. Definitions
– Context Cwi: all terms related to word w at time i
– Temporal co-references: names used for the same entity at
same or different points in time.
– Direct temp. co-reference: co-references with lexical overlap
– Indirect temp. co-reference: co-references without lexical
overlap
– Change period (CP): period of time where change occurs.
Cardinal Joseph Ratzinger
Pope Benedict XVI
CP = 2005
4. Our Method
1. Identify change periods
2. Create one context per CP.
3. Capture at least two co-references
No need to compare vastly different contexts!
time
Cwalkman-discman Cmp3 player -ipodCdiscman-minidisc
t1 t2 t3
Change period
discman minidiscwalkman discman mp3 player ipod
Cminidisc-mp3 player
t2
minidisc mp3 player
5. Finding Change Periods
• Kleinberg’s burst
detection
• Out of the box Java
implementation from
CIShell
• Compare to
manually found
change periods
(Known CPs)
6. Finding Direct Co-references
1. Extract text for each
change period
2. Term & NE extraction
3. Build co-occurrence graph
4. Rules to merge terms from
dictionary and graph
Sub-Term Rule: Cardinal Joseph Ratzinger ↔ Joseph Ratzinger
Prefix/suffix Rule: Cardinal Joseph Ratzinger ↔ Cardinal Ratzinger
Prolongation Rule: Pope John Paul + John Paul II = Pope John Paul II
Cardinal Joseph Ratzinger,
Cardinal Ratzinger,
Joseph Ratzinger
Cardinal Joseph Ratzinger,
Cardinal Ratzinger,
Joseph Ratzinger
7. Detailed Merging
• Merge one token terms (Co-ref classes):
– Pope Benedict and Benedict = corefBenedict {Pope
Benedict, Benedict}
– Benedict XVI and Benedict = corefBenedict {Benedict
XVI, Benedict}
choose Benedict as representative – highest
frequency
• Merge co-reference classes
corefBenedict{Pope Benedict, Benedict, Benedict
XVI}
• Apply remaining rules:
– Merge corefBenedict with Pope Benedict XVI (subterm
rule)
corefBenedict {Pope Benedict, Benedict, Benedict
XVI, Pope Benedict XVI}
We use the characteristics of NEE to desing our algorithm.Just to recap, the characteristics of NEE are ’instant’ changes, small or no concept shifts and announcements to the public.Let’s go through a couple of definitions needed for the algorithm.
We start by extracting documents corresponding to each change period, where at least one of the strings of the name are included. For Pope Benedict XVI, we require only Benedict or Pope. From this dataset, we then extract nouns (max length 3) and Named Entities and sum up the frequencies. Stanford NER which recognized Barack Obama but not President Barack Obama and therefore Barack Obama is counted twice by NEER. All the terms are stored in a dictionary.Then we build a co-ocurrence graph using hte extracted terms and the dictionary, We consider 9 terms on either side of a term..And we start merging direct co-references according to three rules. At each time two terms are merged we choose one representative which is the one with the highest frequency. Here is matters that Barack Obama was found by both the Lingua Tagger and Stanford Tagger.WE have the prolongation rule because we limit the max lenght of the terms to reduce noise.The co-occurring terms are considered to be candidates for INDIRECT co-rerences.