Entity extraction finds names in documents, providing important raw material for big decisions. But finding all mentions of the name “George Bush” is very different than finding all mentions of the 43rd US President.
Making big decisions from big data is hopeless unless analytics advance from providing snippets of text to providing statements of truth. Such advances present challenges both of accuracy and of usability. We’ll explore these challenges and demonstrate ways of addressing them.
View more slides from the Human Language Technology Conference 2012 here: http://info.basistech.com/hlt-2012-slides
Powerful Google developer tools for immediate impact! (2023-24 C)
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference
1. Things, not Strings:
From Entity Extraction to Entity Resolution
David Murgatroyd
VP, Engineering
Basis Technology
Basis Technology – Human Language Technology Conference 2012 1
2. Motivation
Your job is to analyze reciprocal antagonism
between Christian and Islamic extremists across the
globe.
You want to find information on the Internet on
Christian extremist reaction to the killing of the U.S.
Ambassador to Libya.
Basis Technology – Human Language Technology Conference 2012 2
18. Filter?
Add some filters (a/k/a facets)…
✗
✗
✓
Basis Technology – Human Language Technology Conference 2012 18
19. Filter?
Add some filters (a/k/a facets)…
✗
✗
✓
Basis Technology – Human Language Technology Conference 2012 19
20. Filter?
Add some filters (a/k/a facets)…
Filter
results
by…
People
<choice
1>
✗
<choice
2>
<choice
3>
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 20
21. Filter?
But what can we use as choices?
Filter
results
by…
People
<choice
1>
✗
<choice
2>
<choice
3>
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 21
22. Entity Extraction (Name Tagging)
Find names of person, places, organizations in document.
Basis Technology – Human Language Technology Conference 2012 22
24. Filter choices?
But what can we use as choices?
Filter
results
by…
People
<choice
1>
✗
<choice
2>
<choice
3>
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 24
25. Filter choices?
Choices: first way that each person was mentioned
in each document?
Filter
results
by…
Persons
named
Kris
Stephens
✗
Chris
Stephens
Dan
Cathy
George
LiBle
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 25
26. Filter?
Choices: first name string for each person in each
document?
Filtered
by…
Persons
named
Chris
Stephens
✗
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 26
27. Filter?
Choices: first name string for each person in each
document?
Filtered
by…
Persons
named
Chris
Stephens
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 27
28. Filter?
Problem: Ambiguity – one name, many entities
Filtered
by…
Persons
named
Chris
Stephens
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 28
29. Filter?
Problem: Variety – one person, many names
Filtered
by…
Filtered
by…
Persons
named
Chris
Stephens
Add
filters…
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 29
30. Filter?
Problem: Variety – one person, many names
Filtered
by…
Persons
named
Chris
Stephens
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
Chris
Stevens
J.
Christopher
✗
Stevens
…
✓
Basis Technology – Human Language Technology Conference 2012 30
31. Where does your favorite data set fall?
Variety
#
of
documents
Thousands
Millions
Billions
1
Ambiguity
Basis Technology – Human Language Technology Conference 2012 31
32. Deal with ambiguity and variety?
Magically group names by person across
documents.
Filter
results
by…
People
<choice
1>
✗
<choice
2>
<choice
3>
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 32
33. Labels for choices?
But there’s still the problem of choices…
Filter
results
by…
People
<choice
1>
✗
<choice
2>
<choice
3>
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 33
34. Labels for choices?
Use person’s name from highest ranked doc?
Still some ambiguity.
Filter
results
by…
People
Kris
Stephens
✗
Chris
Stephens
1
Chris
Stephens
2
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 34
35. Labels for choices?
Entity Resolution: group and also link to a
database of known entities (e.g., Wikipedia).
Filter
results
by…
People
Kris
Stephens
✗
J.
Christopher
Chris
Stephens
1
Stevens
Chris
Stephens
2
Chris
…
Stephens
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 35
36. Labels for choices?
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Filter
results
by…
People
Kris
Stephens
✗
J.
Christopher
Stevens
Chris
Stephens
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 36
37. Filter?
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Filter
results
by…
People
Kris
Stephens
(pastor)
✗
J.
Christopher
Stevens
Chris
Stephens
(pastor)
✗
✓
Basis Technology – Human Language Technology Conference 2012 37
38. Filter.
Let’s give it a try…
Filter
results
by…
People
Kris
Stephens
✗
(pastor)
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 38
39. Filter.
Let’s give it a try…
Filtered
by…
People
J.
Christopher
✗
Stevens
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
✗
✓
Basis Technology – Human Language Technology Conference 2012 39
40. Filter.
Let’s give it a try…
Filtered
by…
People
J.
Christopher
Stevens
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
✓
Basis Technology – Human Language Technology Conference 2012 40
41. Filter.
Let’s give it a try…
Filtered
by…
People
J.
Christopher
Stevens
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
✓
Dan
Cathy
George
LiBle
…
Basis Technology – Human Language Technology Conference 2012 41
42. Filter.
Let’s give it a try…
Filtered
by…
People
J.
Christopher
Stevens
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
✓
Dan
Cathy
George
LiBle
…
✓
✓
Basis Technology – Human Language Technology Conference 2012 42
43. Does it work?
How do you measure?
Basis Technology – Human Language Technology Conference 2012 43
44. How do you measure?
Imagine this was the result of applying the filter with
the name from wikipedia.
Filtered
by…
People
J.
Christopher
Stevens
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Basis Technology – Human Language Technology Conference 2012 44
45. How do you measure?
Precision: for each document, how much of the stuff
grouped with it is correct?
Filtered
by…
People
J.
Christopher
✗
1
/
3
=
33%
Stevens
Add
filters…
✓
2
/
3
=
67%
People
Kris
Stephens
(pastor)
Chris
Stephens
✓
2
/
3
=
67%
(pastor)
Dan
Cathy
George
LiBle
…
Basis Technology – Human Language Technology Conference 2012 45
46. How do you measure?
Recall: for each document, how much of the correct
stuff is grouped with?
Filtered
by…
People
J.
Christopher
Stevens
Add
filters…
✓
2
/
5
=
40%
People
Kris
Stephens
(pastor)
Chris
Stephens
✓
2
/
5
=
40%
(pastor)
Dan
Cathy
✗
George
LiBle
…
✗
✗
Basis Technology – Human Language Technology Conference 2012 46
47. Does it work?
We often combine Precision and Recall
measurements into a single
measurement, called “F”.
Basis Technology – Human Language Technology Conference 2012 47
48. Where does your favorite data set fall?
Variety
#
of
documents
Thousands
Millions
Billions
1
Ambiguity
Basis Technology – Human Language Technology Conference 2012 48
49. Where does your favorite data lie?
corpus
ACE
2005
WEPS-‐2
TAC
pre-‐2012
TAC
eng
2012
TAC
zho
2012
TAC
spa
2012
Basis
Balanced
Basis
Ambig
Basis
Variance
1
Basis
Variance
2
F>=?
Variety
F>=70
#
of
documents
Thousands
Millions
Billions
F>=85
1
Ambiguity
Basis Technology – Human Language Technology Conference 2012 49
50. Trading off Errors
Let’s pretend you’re researching the pastors
instead.
Filter
results
by…
People
Kris
Stephens
(pastor)
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Basis Technology – Human Language Technology Conference 2012 50
51. Trading off Errors
What if you think there are too many (or too few)?
Add a slider for making filter more fine (or coarse).
Filtered
by…
People
Kris
Stephens
(pastor)
Add
filters…
People
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Basis Technology – Human Language Technology Conference 2012 51
52. Trading off Errors
Make the filter more fine.
Filtered
by…
People
Kris
Stephens
(pastor)
Add
filters…
People
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Basis Technology – Human Language Technology Conference 2012 52
54. Questions
• Suggested questions:
– Doesn’t Google already do this?
– Speed? Scale?
– Multi-lingual?
– What other uses are there for entity resolution
beyond faceted search?
Basis Technology – Human Language Technology Conference 2012 54
55. Thank you!
For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090
Basis Technology – Human Language Technology Conference 2012 55
56. Doesn’t
Google
already
do
this?
Some, when searching for famous entities.
Basis Technology – Human Language Technology Conference 2012 56
57. Speed/Scale
• Support from BRAVE for scale in CY13!
• Research version:
– tested up to 1m docs
– Sub-second per document
– Incremental updates (i.e., you see documents
published minutes ago)
Basis Technology – Human Language Technology Conference 2012 57
58. Doesn’t
Google
already
do
this?
Basis Technology – Human Language Technology Conference 2012 58
59. Other uses for entity resolution ?
• Supporting relationship resolution by resolving
participating entities in the them.
• Knowledge base population
• Integrating disparate data sets
• Alerting
• Improving relevance of search results
• Predictive Analytics
Basis Technology – Human Language Technology Conference 2012 59