The document discusses adaptive entity linking. It presents the motivation for entity linking as enabling reuse of web knowledge and as a first step for ontology learning. The problem is that current entity linking approaches do not work across all domains and text types. The proposed solution is to use linked data datasets and a framework called AELA for adaptive entity linking. Experiments were conducted on an annotated dataset to analyze how the definition of an entity changes across domains and to identify entity types.
3. www.insight-‐centre.org
Mo%va%on
• En%ty
Linking
creates
links
from
men%ons
in
text
to
en%%es
from
a
structured
knowledge
base.
It
..
..
enables
reusing
knowledge
already
published
on
the
web.
..
can
be
used
as
the
first
step
for
ontology
learning
and
popula%on
algorithms.
4. www.insight-‐centre.org
Problem
• En%ty
Linking
has
been
performed
using
generic
approaches.
• It
does
not
work
for
all
domains
and
types
of
text.
• There
is
no
clear
defini%on
of
“en%ty”.
5. www.insight-‐centre.org
Problem
• Research
Ques%on:
“How
to
adapt
a
general
En%ty
Linking
Approach
to
a
Domain?”
• Philosophical
Ques%on:
“What
is
an
En%ty?”
7. www.insight-‐centre.org
Experiments
• What
is
an
En%ty?
• What
have
been
iden%fied
as
en%%es?
• How
to
manually
detect
en%%es
from
text?
• How
the
defini%on
of
En%ty
change
from
one
domain
to
another?
8. www.insight-‐centre.org
Experiments
• What
is
an
En%ty?
• What
have
been
iden8fied
as
en88es?
• How
to
manually
detect
en%%es
from
text?
• How
the
defini%on
of
En%ty
change
from
one
domain
to
another?
9. www.insight-‐centre.org
Experiments
• AIDA-‐CoNLL
annotated
dataset
– 1,387
Reuters
documents
(some
of
them
are
tables)
– Annota%on
of
en%%es
with
links
to
Wikipedia.
10. www.insight-‐centre.org
Experiments
• AIDA-‐CoNLL
annotated
dataset
– 1,387
Reuters
documents
(some
of
them
are
tables)
– Annota%on
of
en88es
with
links
to
Wikipedia.
?
11. www.insight-‐centre.org
Experiments
–
AIDA
CoNLL
• Proper
Nouns:
5576
– Names
ini%ated
by
a
capitalized
leber
• Acronyms:
712
– Names
with
all
lebers
in
upper
case
• Others:
20
12. www.insight-‐centre.org
AIDA
CoNLL
–
Proper
Nouns
• German
• Bri%sh
• European
Commission
• Germany
• European
Union
• Britain
• Commission
• Franz
Fischler
• France
• Spanish
• Loyola
de
Palacio
• Europe
• Bonn
• Hendrix
• U.S.
• Jimi
Hendrix
• English
• Noengham
• Australian
• China
• Taiwan
• Taipei
• Taiwan
Strait
• Ukraine
• Taiwanese
• Lien
Chan
• Chinese
• Foreign
Ministry
13. www.insight-‐centre.org
AIDA
CoNLL
–
Proper
Nouns
• German
• Bri%sh
• European
Commission
• Germany
• European
Union
• Britain
• Commission
• Franz
Fischler
• France
• Spanish
• Loyola
de
Palacio
• Europe
• Bonn
• Hendrix
• U.S.
• Jimi
Hendrix
• English
• Noengham
• Australian
• China
• Taiwan
• Taipei
• Taiwan
Strait
• Ukraine
• Taiwanese
• Lien
Chan
• Chinese
• Foreign
Ministry
15. www.insight-‐centre.org
AIDA
CoNLL
-‐
Others
• interior
ministry
• neo-‐Nazi
• neo-‐Nazism
• post-‐Soviet
• van
der
Sar
• 1860
Munich
• serie
A
• 1990
World
Cup
• 1992
European
championship
• 2,000
Guineas
• 2000
Games
• pan-‐Turkism
• al-‐Akhbar
• al-‐Ram
• 1997
FED
CUP
• 1998
World
Cup
• 1995
World
Cup
• 1.
FC
Cologne
• post-‐Communist
• cocker
spaniels
16. www.insight-‐centre.org
AIDA
CoNLL
-‐
Others
• interior
ministry
• neo-‐Nazi
• neo-‐Nazism
• post-‐Soviet
• van
der
Sar
• 1860
Munich
• serie
A
• 1990
World
Cup
• 1992
European
championship
• 2,000
Guineas
• 2000
Games
• pan-‐Turkism
• al-‐Akhbar
• al-‐Ram
• 1997
FED
CUP
• 1998
World
Cup
• 1995
World
Cup
• 1.
FC
Cologne
• post-‐Communist
• cocker
spaniels
17. www.insight-‐centre.org
AIDA
CoNLL
-‐
Others
• interior
ministry
• neo-‐Nazi
• neo-‐Nazism
• post-‐Soviet
• van
der
Sar
• 1860
Munich
• serie
A
• 1990
World
Cup
• 1992
European
championship
• 2,000
Guineas
• 2000
Games
• pan-‐Turkism
• al-‐Akhbar
• al-‐Ram
• 1997
FED
CUP
• 1998
World
Cup
• 1995
World
Cup
• 1.
FC
Cologne
• post-‐Communist
• cocker
spaniels
19. www.insight-‐centre.org
AIDA
CoNLL
–
Some
findings
• Syntac%c
structure
does
not
help
in
all
cases.
– Proper
Nouns
may
not
be
ini%alized
by
a
capitalized
leber.
– Not
all
words
with
all
lebers
in
upper
case
are
Acronyms.
• There
may
be
some
“men%on
boundary”
problems
even
on
manual
annota%on.
21. www.insight-‐centre.org
AIDA
CoNLL
• 1110
en%%es
with
name
varia%ons.
hbp://en.wikipedia.org/wiki/New_York_Jets
New
York
Jets
NY
JETS
hbp://en.wikipedia.org/wiki/Butch_Harmon
Butch
Harmon
Butch
hbp://en.wikipedia.org/wiki/Norway
Norway
Norwegian
hbp://en.wikipedia.org/wiki/Cincinna%_Reds
Cincinna%
Reds
CINCINNATI
Reds
hbp://en.wikipedia.org/wiki/Republika_Srpska
Bosnian
Serb
Republika
Srpska
hbp://en.wikipedia.org/wiki/John_Smoltz
John
Smoltz
Smoltz
hbp://en.wikipedia.org/wiki/Rede_Globo
TV
Globo
Globo
hbp://en.wikipedia.org/wiki/London_Wasps
London
Wasps
hbp://en.wikipedia.org/wiki/Chicago_Cubs
CHICAGO
CUBS
Chicago
Cubs
hbp://en.wikipedia.org/wiki/England_cricket_team
ENGLAND
Englishmen
hbp://en.wikipedia.org/wiki/Alexander_Downer
Alexander
Downer
Downer
hbp://en.wikipedia.org/wiki/Wales
Wales
Welsh
22. www.insight-‐centre.org
AIDA
CoNLL
• 1110
en%%es
with
name
varia%ons.
hbp://en.wikipedia.org/wiki/New_York_Jets
New
York
Jets
NY
JETS
hbp://en.wikipedia.org/wiki/Butch_Harmon
Butch
Harmon
Butch
hCp://en.wikipedia.org/wiki/Norway
Norway
Norwegian
hbp://en.wikipedia.org/wiki/Cincinna%_Reds
Cincinna%
Reds
CINCINNATI
Reds
hbp://en.wikipedia.org/wiki/Republika_Srpska
Bosnian
Serb
Republika
Srpska
hbp://en.wikipedia.org/wiki/John_Smoltz
John
Smoltz
Smoltz
hbp://en.wikipedia.org/wiki/Rede_Globo
TV
Globo
Globo
hbp://en.wikipedia.org/wiki/London_Wasps
London
Wasps
hbp://en.wikipedia.org/wiki/Chicago_Cubs
CHICAGO
CUBS
Chicago
Cubs
hCp://en.wikipedia.org/wiki/England_cricket_team
ENGLAND
Englishmen
hbp://en.wikipedia.org/wiki/Alexander_Downer
Alexander
Downer
Downer
hCp://en.wikipedia.org/wiki/Wales
Wales
Welsh
23. www.insight-‐centre.org
AIDA
CoNLL
–
Some
findings
• Use
of
metonymy.
• Disambigua%on
(Norway
vs.
Norwegians).
• Men%on
to
an
en%ty
using
part
of
the
name.
30. www.insight-‐centre.org
Experiments
• How
were
those
en%%es
annotated?
• Which
Wikipedia
pages
were
chosen
as
represen%ng
en%%es?
• What
is
the
Annota8on
Guideline?
31. www.insight-‐centre.org
Experiments
• What
is
an
En%ty?
• What
have
been
iden%fied
as
en%%es?
• How
to
manually
detect
en88es
from
text?
• How
the
defini%on
of
En%ty
change
from
one
domain
to
another?
32. www.insight-‐centre.org
Experiments
• Survey
on
Annota%on
Guidelines
– Ques%on:
“Is
there
any
guideline
for
en%ty
annota%on?”
– Search
Strategy:
• Papers
from
“en%ty
annota%on
guidelines”.
• Guidelines
from
annotated
corpora
provided
by
En%ty
Recogni%on,
Disambigua%on
and
Linking
challenges.
33. www.insight-‐centre.org
Experiments
• Survey
on
Annota%on
Guidelines
– Common
Problems
(differ
from
one
domain
to
another)
• Men%on
Boundaries
• Name
varia%ons
• Metonymy
– Annota%on
Process
– Evalua%on
34. www.insight-‐centre.org
Next
Steps
• Corpus
Sampling
for
Annota%on
• Development
of
Annota%on
Guidelines
– Domain/Task
dependent
– Itera%ve
Process
• Domains:
– Touris%c
Domain
(TripAdvisor
corpus)
– Electronics
Domain
– Other
35. www.insight-‐centre.org
Next
Steps
• What
is
an
En%ty?
• What
have
been
iden%fied
as
en%%es?
• How
to
manually
detect
en%%es
from
text?
• How
the
defini8on
of
En8ty
change
from
one
domain
to
another?
36. www.insight-‐centre.org
Next
Steps
• What
is
an
En%ty?
• What
have
been
iden%fied
as
en%%es?
• How
to
manually
detect
en%%es
from
text?
• How
the
defini%on
of
En%ty
change
from
one
domain
to
another?
• How
to
iden8fy
the
most
frequent
classes
in
a
domain?