Harnessing diversity in crowds and machines for better ner performance

Web & Media Group
http://CrowdTruth.org
Harnessing Diversity in Crowds and
Machines for Better NER Performance
Oana Inel and Lora Aroyo Extended Semantic Web
Conference
May 30th 2017

Web & Media Group
Ground Truth Datasets:
OKE2015: Open Knowledge Extraction
challenge (ESWC) 2015
• 101 sentences
• 664 entities
OKE2016: Open Knowledge Extraction
challenge (ESWC) 2016
• 55 sentences
• 340 entities
Named Entities of Type:
Person, Place, Organization & Role
https://github.com/anuzzolese/oke-challenge
https://github.com/anuzzolese/oke-challenge-2016
What’s wrong with the
ground truth for NER?

Web & Media Group
What’s wrong with the ground truth for NER?
• Inconsistencies in annotating entities of type ROLE and PERSON
… and similar for
• But, “Bishop Petronius” is a PERSON
... controversial [bishop] [Richard Williamson].
● [bishop] → role
● [Richard Williamson] → person
… appointed Knight Bachelor by the [Queen] [Elizabeth II] in 1988.
Bologna was reborn in the 5th century under [Bishop Petronius].

Web & Media Group
• Inconsistencies across datasets
• 4 cases in OKE2015 where concatenation of entities of type
PLACE were treated as a single entity
… but, the offsets, do not match with the actual string
• In OKE2016, such entities were classified as two entities of type
PLACE
Such a man did post-doctoral work at the Salk Institute in San Diego in
the laboratory of Renato Dulbecco, then worked at the Basel Institute for
Immunology in [Basel, Switzerland].

Web & Media Group
Giulio Natta was born in Imperia, Italy. [He] earned [his] degree in
chemical engineering from the Politecnico di Milano university in Milan
in 1924.
[One of the them] was an eminent scholar at Berkeley.
• There are errors in the ground truth
• 1 case in OKE2015
• Personal pronouns (co-references) and possessive pronouns
are considered named entities of type PERSON
• 85/304 cases (OKE2015)
• 27/105 cases (OKE2016)

Web & Media Group
How reliable are the NER tools
trained using this ground truth?
… but, then ...

Web & Media Group
How do state-of-the-art
NER tools perform on this
ground truth?
Performance of NER Tools:
• NERD-ML
• TextRazor
• THD
• DBpediaSpotlight
• SemiTags
http://nerd.eurecom.fr  
https://www.textrazor.com  
http://ner.vse.cz/thd/  
http://dbpedia-spotlight.github.io/demo/  
http://nlp.vse.cz/SemiTags/

Web & Media Group
• do not extract personal pronouns (co-references) and possessive
pronouns as type PERSON entities
• 83/85 missed cases in OKE2015
• 26/27 missed cases in OKE2016
• extract multiple span variations for the same entity type PERSON
• 73/92 cases in OKE2015
… let’s see how NER tools perform on this GT
Giulio Natta was born in Imperia, Italy. [He] earned [his] degree in chemical
engineering from the Politecnico di Milano university in Milan in 1924.
The woman was awarded the Nobel Prize in Physics in 1963, which she
shared with [[J.] [[Hans] D.] [Jensen]] and Eugene Wigner.

Web & Media Group
● Many entities of type ORGANIZATION are a combination of
ORGANIZATION + PLACE
• NER tools tend to extract each entity granularity
• GT does not allow for overlapping entities or multiple perspectives
• 105/213 cases in OKE2015
… let’s see how NER tools perform on this GT
Such a man did post-doctoral work at the Salk Institute in San Diego in the
laboratory of Renato Dulbecco, then worked at the [[[Basel] [Institute]] for
Immunology] in Basel, Switzerland.

Web & Media Group
… and now let’s see their overall performance
● High disagreement between the NER tools
○ Similar performance in F1, but different #FP, #TP, #FN

Web & Media Group
… and now let’s see their overall performance
● High disagreement between the NER tools
○ Similar performance in F1, but different #FP, #TP, #FN
● Low recall, many entities missed

Web & Media Group
difficult to understand the reliability of different
NER tools
it’s difficult to choose the “best one” for your case

Web & Media Group
Hypothesis
Aggregating the output of NER tools by harnessing the inter-tool
disagreement (Multi-NER) performs better than the individual
NERs (Single-NER)

Web & Media Group
Multi-NER vs. SOTA Single-NER

Web & Media Group
• Multi-NER significantly higher #TP & lower #FN

Web & Media Group
• Multi-NER significantly higher #TP & lower #FN
• Multi-NER significantly higher #FP

Web & Media Group
http://CrowdTruth.org17
The more the merrier?
Is performance correlated with the number of NER tools that
extracted a given named entity?
● Performance comparison:
• Likelihood of an entity to be contained in the gold standard
based on how many NER tools extracted it
Sentence-entity score = ratio of NER that extracted the entity

Web & Media Group
Disagreement Aware Multi-NER vs. Single-NER
Multi-NER outperforms SOTA Single-NER tools
at a sentence-entity score threshold >= 0.4

Web & Media Group
Disagreement Aware Multi-NER vs. Single-NER
19
Multi-NER outperforms the majority vote approach
(sentence-entity score threshold = 0.6)
at a sentence-entity score threshold >= 0.4

Web & Media Group
Hypothesis
Crowdsourced ground truth by harnessing inter-annotator
disagreement (Multi-NER+CrowdGT) produces diversity in
annotations and thus, improves the aggregated output of NER
tools

Web & Media Group
• Identify all the valid expressions and their types
• Reduce the number of FP and FN cases
Crowdsourcing for Better Ground Truth

Web & Media Group
Crowdsourcing Task

Web & Media Group
Crowdsourcing for Better NER
For each crowd sentence-entity score the crowd enhanced Multi-NER
(Multi-NER+Crowd) performs much better than the Multi-NER

Web & Media Group
Crowdsourcing for Better Ground Truth
Multi-NER+CrowdGT, the enhanced Multi-NER through crowd-driven
ground truth gathering performs better than Multi-NER+Crowd.

Web & Media Group
Conclusions & Future Work
● a hybrid Multi-NER crowd-driven approach for improved NER
performance
● the crowd is able to correct the mistakes of the NER tools by reducing the
total number of false positive cases
● the crowd-driven ground truth, that harnesses diversity, perspectives and
granularities, proves to be more reliable
● improving the event detection ground truth (FactBank) through
disagreement-aware crowdsourcing
● train an event classifier with crowd-driven events
Data
http://data.crowdtruth.org
http://data.crowdtruth.org/crowdsourcing-ne-goldstandards/
Tool & Code
http://dev.CrowdTruth.org
http://github.com/CrowdTruth

Harnessing diversity in crowds and machines for better ner performance

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Harnessing diversity in crowds and machines for better ner performance

Ähnlich wie Harnessing diversity in crowds and machines for better ner performance (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Harnessing diversity in crowds and machines for better ner performance