Visit to a blind student's school🧑🦯🧑🦯(community medicine)
Michael Wick - Human Machine Cooperation: User Corrections for AKBC
1. Human Machine Cooperation:
User Corrections for AKBC
Michael Wick, Karl Schultz, Andrew McCallum
University of Massachusetts, Amherst.
2. Motivation
• KBs for real-world decision making
• Problem: data needs integration
• AKBC/IE is scalable, but inaccurate
• Humans are more accurate, lack coverage
• Question: how do we combine human
and machine KBC?
3. Goal: build a database of every scientist in the world.
4. Knowledge Base Construction
.pdf
Text
Text
.bib
docs Structured
docs
.html Data query
Entity Relation Entities,
Mentions Mentions Relations
Entity Relation Resolution KB
Extraction Extraction (Coref)
Wei Li Attends( Wei Li
W. Li Wei Li, W. Li
Xinghua U. Xinghua U.) Xinghua U.
“truth”
answer
Problem:
(1) errors snowball in IE pipeline
(2) errors persist in DB - forever
5. KB Coreference Errors
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
id=5 id=3
id=1
7. KB Coreference Errors
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
id=5 id=3
Coref?
Features:
1. Institution overlap... NO
2.Venue overlap... NO
3. Topic overlap... LOW
id=1
8. KB Coreference Errors
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
id=5 id=3
Coref? NO
Features:
1. Institution overlap... NO
2.Venue overlap... NO
3. Topic overlap... LOW
id=1
9. Human Edits to Coreference
“Fernando Pereira with id=5 is Fernando Pereira with id=3 ”
10. Human Edits to Coreference
“Fernando Pereira with id=5 is Fernando Pereira with id=3 ”
“Fernando Pereira with id=2 is Fernando Pereira with id=1”
11. Human Edits to Coreference
“Fernando Pereira with id=5 is Fernando Pereira with id=3 ”
“Fernando Pereira with id=2 is Fernando Pereira with id=1”
“Fernando Pereira with id=5 is Fernando Pereira with id=4”
15. Edits to Coreference
KB with coref errors
Stream of user edits
good edit bad edit
must-link
must-link
16. Edits to Coreference
KB with coref errors
Stream of user edits
good edit bad edit
must-link
must-link
Incorporate edits: how do we resolve conflicts?
17. Strategy 1: Most recent edit gets priority
Edit 1: good edit Edit 2: bad edit
must-link
must-link
Edit order: 2 then 1
18. Strategy 1: Most recent edit gets priority
Edit 1: good edit Edit 2: bad edit
must-link
must-link
Edit order: 2 then 1
19. Strategy 1: Most recent edit gets priority
Edit 1: good edit Edit 2: bad edit
must-link
must-link
Edit order: 2 then 1
20. Strategy 1: Most recent edit gets priority
Edit 1: good edit Edit 2: bad edit
must-link
must-link
Edit order: 1 then 2
21. Strategy 1: Most recent edit gets priority
Edit 1: good edit Edit 2: bad edit
must-link
must-link
Edit order: 1 then 2
22. Strategy 1: Most recent edit gets priority
Edit 1: good edit Edit 2: bad edit
must-link
must-link
Edit order: 1 then 2
23. Strategy 2: Deterministic integration of edits
Edit 1: good edit Edit 2: bad edit
must-link
must-link
ity
si e
an rc
tiv
tr fo
En
24. Strategy 2: Deterministic integration of edits
Edit 1: good edit Edit 2: bad edit
must-link
must-link
ity
si e
an rc
tiv
tr fo
En
26. How should edits be managed?
• User modification of “the truth” is risky
• Humans disagree
• Humans make mistakes
• “Truth” changes over time
27. How should edits be managed?
• User modification of “the truth” is risky
• Humans disagree
• Humans make mistakes
• “Truth” changes over time
• Our approach:
• edits as statistical evidence
• “truth” inferred from evidence
30. What is the truth?
The
truth
Evidence
Unstructured data (e.g.PDFs)
31. What is the truth?
The
truth Structured data
(e.g., ACM, DBLP)
Evidence
Unstructured data (e.g.PDFs)
32. What is the truth?
The
truth Structured data
(e.g., ACM, DBLP)
Evidence
User edits
Unstructured data (e.g.PDFs)
33. What is the truth?
The
Infered by MCMC
truth Structured data
IE models (e.g., ACM, DBLP)
(e.g., CRFs)
Evidence
User edits
Unstructured data (e.g.PDFs)
34. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
35. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
“The CRF Fernando Pereira is the Prolog Fernando Pereira”
36. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
“The CRF Fernando Pereira is the Prolog Fernando Pereira”
“The NLP Fernando Pereira is the MPEG Fernando Pereira”
37. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
Name: Fernando Pereira
Institution: Google
“The CRF Fernando Pereira is the Prolog Fernando Pereira”
“The NLP Fernando Pereira is the MPEG Fernando Pereira”
38. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
Name: Fernando Pereira Name: Fernando Pereira
Institution: Google Institution: U. Edinburgh
“The CRF Fernando Pereira is the Prolog Fernando Pereira”
“The NLP Fernando Pereira is the MPEG Fernando Pereira”
39. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
Name: Fernando Pereira Name: Fernando Pereira
Institution: Google must-link Institution: U. Edinburgh
“The CRF Fernando Pereira is the Prolog Fernando Pereira”
“The NLP Fernando Pereira is the MPEG Fernando Pereira”
40. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
Name: Fernando Pereira Name: Fernando Pereira
Institution: Google must-link Institution: U. Edinburgh
“The CRF Fernando Pereira is the Prolog Fernando Pereira”
Name: Fernando Pereira Name: Fernando Pereira
Topics: CRF must-link Topics: Prolog
“The NLP Fernando Pereira is the MPEG Fernando Pereira”
41. Human Edits as Evidence
“The Fernando Pereira at Google is the Fernando Pereira at U. Edinburgh”
Name: Fernando Pereira Name: Fernando Pereira
Institution: Google must-link Institution: U. Edinburgh
“The CRF Fernando Pereira is the Prolog Fernando Pereira”
Name: Fernando Pereira Name: Fernando Pereira
Topics: CRF must-link Topics: Prolog
“The NLP Fernando Pereira is the MPEG Fernando Pereira”
Name: Fernando Pereira Name: Fernando Pereira
Topics: NLP must-link Topics: MPEG
42. Human Edits: Mentions
Added to DB
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
Name: Fernando Pereira Name: Fernando Pereira
Institution: Google Institution: U. Edinburgh
Name: Fernando Pereira Name: Fernando Pereira
Topics: CRF Topics: Prolog
Name: Fernando Pereira Name: Fernando Pereira
Topics: NLP Topics: MPEG
43. Human Edits:
Perform Coreference
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
Name: Fernando Pereira Name: Fernando Pereira
Institution: Google Institution: U. Edinburgh
Name: Fernando Pereira Name: Fernando Pereira
Topics: CRF Topics: Prolog
Name: Fernando Pereira
Topics: NLP
Name: Fernando Pereira
Topics: MPEG
44. Human Edits:
Perform Coreference
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
45. Human Edits:
Perform Coreference
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
Coref?
Features:
1. Institution overlap... NO
2.Venue overlap... NO
3. Topic overlap... LOW
4. Should-link... YES
46. Human Edits:
Perform Coreference
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: U. Edinburgh, SRI
Topics: CRF, IE, NLP Topics: logic programming,
Venues: ICML, NIPS, EMNLP AI, urban traffic modeling, NLP
Venues: Logic programming
Coref? YES
Features:
1. Institution overlap... NO
2.Venue overlap... NO
3. Topic overlap... LOW
4. Should-link... YES
47. Incorrect edit
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: Superior Tecnic
Topics: CRF, IE, NLP Topics: MPEG
Venues: ICML, NIPS, EMNLP Venues: ICIP
48. Incorrect edit
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: Superior Tecnic
Topics: CRF, IE, NLP Topics: MPEG
Venues: ICML, NIPS, EMNLP Venues: ICIP
Coref?
49. Incorrect edit
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: Superior Tecnic
Topics: CRF, IE, NLP Topics: MPEG
Venues: ICML, NIPS, EMNLP Venues: ICIP
Coref?
Features:
1. Institution overlap... NO
2.Venue overlap... NO
3. Topic overlap... NO
4. Should-link... YES
50. Incorrect edit
First: Fernando First: Fernando
Last: Pereira Last: Pereira
Institution: Google,UPenn Institution: Superior Tecnic
Topics: CRF, IE, NLP Topics: MPEG
Venues: ICML, NIPS, EMNLP Venues: ICIP
Coref? NO
Features:
1. Institution overlap... NO
2.Venue overlap... NO
3. Topic overlap... NO
4. Should-link... YES
53. Experiments
1. Build initial KB with automatic coreference
2. Simulate user edits
good edit bad edit
must-link
must-link
54. Experiments
1. Build initial KB with automatic coreference
2. Simulate user edits
good edit bad edit
must-link
must-link
3. Apply edits: our probabilistic vs two deterministic approaches
55. Hierarchical + Human Edits
Better incorporation of correct human edits
Database quality versus the number of correct human edits
Edit incorporation strategy
Our probabilistic
0.80
Epistemological (probabilistic)
Overwrite
Maximally satisfy
reasoning
0.75
0.70
F1 accuracy
Local
0.65
satisfaction
0.60
Traditional
0.55
Overwrite
0 5 10 15 20 25 30
No. of human edits
56. Hierarchical + Human Edits
More robust to incorrect human edits
Database quality versus the number of errorful human edits
Our probabilistic Edit incorporation strategy
0.8
Epistemological (probabilistic)
reasoning Complete trust in users
0.7
0.6
Precision
0.5
Complete trust
in humans
0.4
0 10 20 30 40 50 60
57.
58. Come see our poster!
• Technical details including
- Hierarchical CRF for coreference
- MCMC for inference
• Probabilistic incorporation of human edits
• Epistemological Databases
THANK YOU
Hinweis der Redaktion
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
*Reminder of an epistemological database: streaming evidence is stored, truth is inferred\n *“Coref is the foundation for everything”\n *“Coref everywhere”\n * I will speak today about our work scaling coreference to large scales\n