Over the last years, information extraction tools have gained a great popularity and brought significant performance improvement in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve \emph{NE Coverage}, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Harnessing diversity in crowds and machines for better ner performance
1. Web & Media Group
http://CrowdTruth.org
Harnessing Diversity in Crowds and
Machines for Better NER Performance
Oana Inel and Lora Aroyo Extended Semantic Web
Conference
May 30th 2017
2. Web & Media Group
Ground Truth Datasets:
OKE2015: Open Knowledge Extraction
challenge (ESWC) 2015
• 101 sentences
• 664 entities
OKE2016: Open Knowledge Extraction
challenge (ESWC) 2016
• 55 sentences
• 340 entities
Named Entities of Type:
Person, Place, Organization & Role
https://github.com/anuzzolese/oke-challenge
https://github.com/anuzzolese/oke-challenge-2016
What’s wrong with the
ground truth for NER?
3. Web & Media Group
http://CrowdTruth.org
What’s wrong with the ground truth for NER?
• Inconsistencies in annotating entities of type ROLE and PERSON
… and similar for
• But, “Bishop Petronius” is a PERSON
... controversial [bishop] [Richard Williamson].
● [bishop] → role
● [Richard Williamson] → person
… appointed Knight Bachelor by the [Queen] [Elizabeth II] in 1988.
Bologna was reborn in the 5th century under [Bishop Petronius].
4. Web & Media Group
http://CrowdTruth.org
What’s wrong with the ground truth for NER?
• Inconsistencies across datasets
• 4 cases in OKE2015 where concatenation of entities of type
PLACE were treated as a single entity
… but, the offsets, do not match with the actual string
• In OKE2016, such entities were classified as two entities of type
PLACE
Such a man did post-doctoral work at the Salk Institute in San Diego in
the laboratory of Renato Dulbecco, then worked at the Basel Institute for
Immunology in [Basel, Switzerland].
5. Web & Media Group
http://CrowdTruth.org
What’s wrong with the ground truth for NER?
Giulio Natta was born in Imperia, Italy. [He] earned [his] degree in
chemical engineering from the Politecnico di Milano university in Milan
in 1924.
[One of the them] was an eminent scholar at Berkeley.
• There are errors in the ground truth
• 1 case in OKE2015
• Personal pronouns (co-references) and possessive pronouns
are considered named entities of type PERSON
• 85/304 cases (OKE2015)
• 27/105 cases (OKE2016)
What’s wrong with the ground truth for NER?
6. Web & Media Group
http://CrowdTruth.org
How reliable are the NER tools
trained using this ground truth?
… but, then ...
7. Web & Media Group
How do state-of-the-art
NER tools perform on this
ground truth?
Performance of NER Tools:
• NERD-ML
• TextRazor
• THD
• DBpediaSpotlight
• SemiTags
http://nerd.eurecom.fr
https://www.textrazor.com
http://ner.vse.cz/thd/
http://dbpedia-spotlight.github.io/demo/
http://nlp.vse.cz/SemiTags/
8. Web & Media Group
http://CrowdTruth.org
• do not extract personal pronouns (co-references) and possessive
pronouns as type PERSON entities
• 83/85 missed cases in OKE2015
• 26/27 missed cases in OKE2016
• extract multiple span variations for the same entity type PERSON
• 73/92 cases in OKE2015
• 9/13 cases in OKE2016
… let’s see how NER tools perform on this GT
Giulio Natta was born in Imperia, Italy. [He] earned [his] degree in chemical
engineering from the Politecnico di Milano university in Milan in 1924.
The woman was awarded the Nobel Prize in Physics in 1963, which she
shared with [[J.] [[Hans] D.] [Jensen]] and Eugene Wigner.
9. Web & Media Group
http://CrowdTruth.org
● Many entities of type ORGANIZATION are a combination of
ORGANIZATION + PLACE
• NER tools tend to extract each entity granularity
• GT does not allow for overlapping entities or multiple perspectives
• 105/213 cases in OKE2015
• 62/157 cases in OKE2016
… let’s see how NER tools perform on this GT
Such a man did post-doctoral work at the Salk Institute in San Diego in the
laboratory of Renato Dulbecco, then worked at the [[[Basel] [Institute]] for
Immunology] in Basel, Switzerland.
10. Web & Media Group
http://CrowdTruth.org
… and now let’s see their overall performance
● High disagreement between the NER tools
○ Similar performance in F1, but different #FP, #TP, #FN
11. Web & Media Group
http://CrowdTruth.org
… and now let’s see their overall performance
● High disagreement between the NER tools
○ Similar performance in F1, but different #FP, #TP, #FN
● Low recall, many entities missed
12. Web & Media Group
http://CrowdTruth.org
difficult to understand the reliability of different
NER tools
it’s difficult to choose the “best one” for your case
13. Web & Media Group
Hypothesis
Aggregating the output of NER tools by harnessing the inter-tool
disagreement (Multi-NER) performs better than the individual
NERs (Single-NER)
14. Web & Media Group
http://CrowdTruth.org
Multi-NER vs. SOTA Single-NER
15. Web & Media Group
http://CrowdTruth.org
• Multi-NER significantly higher #TP & lower #FN
Multi-NER vs. SOTA Single-NER
16. Web & Media Group
http://CrowdTruth.org
• Multi-NER significantly higher #TP & lower #FN
• Multi-NER significantly higher #FP
Multi-NER vs. SOTA Single-NER
17. Web & Media Group
http://CrowdTruth.org17
The more the merrier?
Is performance correlated with the number of NER tools that
extracted a given named entity?
● Performance comparison:
• Likelihood of an entity to be contained in the gold standard
based on how many NER tools extracted it
Sentence-entity score = ratio of NER that extracted the entity
Multi-NER vs. SOTA Single-NER
18. Web & Media Group
http://CrowdTruth.org18
Disagreement Aware Multi-NER vs. Single-NER
Multi-NER outperforms SOTA Single-NER tools
at a sentence-entity score threshold >= 0.4
19. Web & Media Group
http://CrowdTruth.org19
Disagreement Aware Multi-NER vs. Single-NER
19
Multi-NER outperforms the majority vote approach
(sentence-entity score threshold = 0.6)
at a sentence-entity score threshold >= 0.4
20. Web & Media Group
Hypothesis
Crowdsourced ground truth by harnessing inter-annotator
disagreement (Multi-NER+CrowdGT) produces diversity in
annotations and thus, improves the aggregated output of NER
tools
21. Web & Media Group
http://CrowdTruth.org21
• Identify all the valid expressions and their types
• Reduce the number of FP and FN cases
Crowdsourcing for Better Ground Truth
22. Web & Media Group
http://CrowdTruth.org22
Crowdsourcing Task
23. Web & Media Group
http://CrowdTruth.org
Crowdsourcing for Better NER
For each crowd sentence-entity score the crowd enhanced Multi-NER
(Multi-NER+Crowd) performs much better than the Multi-NER
24. Web & Media Group
http://CrowdTruth.org
Crowdsourcing for Better Ground Truth
Multi-NER+CrowdGT, the enhanced Multi-NER through crowd-driven
ground truth gathering performs better than Multi-NER+Crowd.
25. Web & Media Group
http://CrowdTruth.org
Conclusions & Future Work
● a hybrid Multi-NER crowd-driven approach for improved NER
performance
● the crowd is able to correct the mistakes of the NER tools by reducing the
total number of false positive cases
● the crowd-driven ground truth, that harnesses diversity, perspectives and
granularities, proves to be more reliable
● improving the event detection ground truth (FactBank) through
disagreement-aware crowdsourcing
● train an event classifier with crowd-driven events
Data
http://data.crowdtruth.org
http://data.crowdtruth.org/crowdsourcing-ne-goldstandards/
Tool & Code
http://dev.CrowdTruth.org
http://github.com/CrowdTruth