[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

How to Evaluate NLP Tools for Entity Extraction
Gil Irizarry
VP Engineering
ODSC East 2020
[Apple | Organization] and [Oranges | Fruit]:

BASIS TECHNOLOGY
Agenda
● About Me
● The Problem Space
● Defining the domain
● Assemble a test set
● Annotation Guidelines
● Review of measurement
● Evaluation examples
● Interannotator agreement
● The steps to evaluation

BASIS TECHNOLOGY
About Me
Gil Irizarry - VP Engineering at Basis Technology, responsible for NLP and Text
Analytics software development
https://www.linkedin.com/in/gilirizarry/
Basis Technology - leading provider of software solutions for extracting
meaningful intelligence from multilingual text and digital devices

BASIS TECHNOLOGY
Rosette Capabilities

BASIS TECHNOLOGY
The Problem Space
● You have some text to analyze. Which tool to choose?
● Related question: You have multiple text or data annotators. Which are
doing a good job?
● The questions are made harder by the tools outputting different formats,
analyzing data differently, and annotators interpreting data differently
● Start by defining the problem space

BASIS TECHNOLOGY
Defining the domain
● What space are you in?
● More importantly, in what domain will you evaluate tools?
● Are you:
○ Reading news
○ Scanning patents
○ Looking for financial fraud

BASIS TECHNOLOGY
Assemble a test set
● NLP systems are often trained on a general corpus. Often this corpus
consists of mainstream news articles.
● Do you use this domain or a more specific one?
● If more specific, do you train a custom model?

BASIS TECHNOLOGY
Annotation Guidelines
Examples requiring definition and agreement in guidelines:
● “Alice shook Brenda’s hand when she entered the meeting.” Is “Brenda” or
“Brenda’s” the entity to be extracted (in addition to Alice of course)?
● Are pronouns expected to be extracted and resolved? “She” in the previous
example
● What about tolerance to punctuation? The U.N. vs. the UN
● Should fictitious characters (“Harry Potter”) be tagged as “person”?
● When a location appears within an organization’s name, do you tag the location
and the organization extracted or just the organization (“San Francisco Association
of Realtors”)?

BASIS TECHNOLOGY
Examples requiring definition and agreement in guidelines:
● Do you tag the name of a person if it is used as a modifier (“Martin Luther King Jr.
Day”)?
● Do you tag “Twitter” in “You could try reaching out to the Twitterverse”?
● Do you tag “Google” in “I googled it, but I couldn’t find any relevant results”?
● When do you include “the” in an entity?
● How do you differentiate between an entity that’s a company name and a product
by the same name? {[ORG]The New York Times} was criticized for an article about
the {[LOC]Netherlands} in the June 4 edition of {[PRO]The New York Times}.
● “Washington and Moscow continued their negotiations.” Are Washington and
Moscow locations or organizations?

BASIS TECHNOLOGY
Non-entity extraction issues:
● How many levels of sentiment do you expect?
● Ontology and text classification - what categories do you expect?
● For language identification, are dialects identified as separate languages?
What about macrolanguages?

BASIS TECHNOLOGY

BASIS TECHNOLOGY
● Map to Universal Dependencies Guidelines where possible:
https://universaldependencies.org/guidelines.html
● Map to DBpedia ontology where possible:
http://mappings.dbpedia.org/server/ontology/classes/
● Map to known database such as Wikidata where possible:
https://www.wikidata.org/wiki/Wikidata:Main_Page

BASIS TECHNOLOGY
Review of measurement: precision
Precision is the fraction of retrieved documents that are relevant to the query

BASIS TECHNOLOGY
Review of measurement: recall
Recall is the fraction of the relevant documents that are successfully retrieved

BASIS TECHNOLOGY
Review of measurement: F-score
F-score is a harmonic mean of precision and recall
Precision and recall are ratios. In this case, a harmonic mean is more
appropriate for an average than an arithmetic mean.

BASIS TECHNOLOGY
Review of measurement: harmonic mean
A harmonic mean returns a single value to combine both precision and recall.
In the below image, a and b map to precision and recall, and H maps to F
score. In this example, note that increasing a would not increase the overall
score.

BASIS TECHNOLOGY
Review of measurement: F-score
Previous example of F score was actually an F1 score, which balances precision
and recall evenly. A more generalized form of F score is:
F2 (β = 2) weights recall higher than precision and F0.5 (β = 0.5) weights
precision higher than recall

BASIS TECHNOLOGY
Review of measurement: AP and MAP
● Average precision is a measure that combines recall and precision for
ranked retrieval results. For one information need, the average precision
is the mean of the precision scores after each relevant document is
retrieved
● Mean average precision is average precision over a range of queries

BASIS TECHNOLOGY
Review of measurement: MUC score
● Message Understanding Conference (MUC) scoring allows for taking
partial success into account
○ Correct: response = key
○ Partial: response ~= key
○ Incorrect: response != key
○ Spurious: key is blank and response is not
○ Missing: response is blank and key is not
○ Noncommittal: key and response are both blank
○ Recall = (correct + (partial x 0.5 )) / possible
○ Precision = (correct+(partial x 0.5)) / actual
○ Undergeneration = missing / possible
○ Overgeneration = spurious / actual

BASIS TECHNOLOGY
Evaluation Examples
As co-sponsor, Tim Cook was seated at a
table with Vogue editor Anna Wintour,
but he made time to get around and see
his other friends, including Uber CEO
Travis Kalanick. Cook's date for the night
was Laurene Powell Jobs, the widow of
Apple cofounder Steve Jobs. Powell
currently runs Emerson Collective, a
company that seeks to make
investments in education. Kalanick
brought a date as well, Gabi Holzwarth, a
well-known violinist.

BASIS TECHNOLOGY
Evaluation Examples - gold standard
Travis Kalanick. Cook's date for the
night was Laurene Powell Jobs, the
widow of Apple cofounder Steve Jobs.
Powell currently runs Emerson
Collective, a company that seeks to make
brought a date as well, Gabi Holzwarth,
a well-known violinist.

BASIS TECHNOLOGY
Evaluation Examples - P, R, F
● (Green) TP = 6
● (Olive) FP = 1
● (Orange) TN = 3
● (Red) FN = 3
● Precision = 6/7 = .86
● Recall = 6/9 = .66
● F score = .74

BASIS TECHNOLOGY
Evaluation Examples - AP
● 1/1 (Green)
● 2/2 (Green)
● 3/3 (Green)
● 0/4 (Red)
● 4/5 (Green)
● 5/6 (Green)
● 0/7 (Red)
● 0/8 (Red)
● 6/9 (Green)
● AP = (1/1 + 2/2 + 3/3 + 4/5 + 5/6 +
6/9) / 6 = .88

BASIS TECHNOLOGY
Evaluation Examples - MUC scoring
Cook's date for the night
was Laurene Powell
Jobs, the widow of Apple
cofounder Steve Jobs.
Cook's date for the night
was Laurene Powell
Jobs, the widow of Apple
cofounder Steve Jobs.
Token Gold Eval
Result
Cook's B-PER B-PER Partial
date O-NONE I-PER
Spurious
for O-NONE O-NONE
Correct
the O-NONE O-NONE
Correct
night O-NONE O-NONE
Correct
was O-NONE O-NONE
Correct
Laurene B-PER B-PER Correct
Powell I-PER I-PER
Correct
Jobs, I-PER O-
NONE Missing
the O-NONE O-NONE
Correct

BASIS TECHNOLOGY
Evaluation Examples - MUC scoring
Possible = Correct + Incorrect + Partial +
Missing = 11 + 1 + 1 + 1 = 14
Actual = Correct + Incorrect + Partial +
Spurious = 11 + 1 + 1 + 2 = 15
Precision = correct + (1/2 partial)) / actual
= 12.5 / 15 = 0.83
Recall = (correct + (1/2 partial)) / possible
= 12.5 / 14 = 0.89
Token Gold Eval
Result
Cook's B-PER B-PER Partial
date O-NONE I-PER
Spurious
for O-NONE O-NONE
Correct
the O-NONE O-NONE
Correct
night O-NONE O-NONE
Correct
was O-NONE O-NONE
Correct
Laurene B-PER B-PER Correct
Powell I-PER I-PER
Correct
Jobs, I-PER O-
NONE Missing
the O-NONE O-NONE
Correct

BASIS TECHNOLOGY
Interannotator Agreement
● Krippendorff ’s alpha is a reliability coefficient developed to measure the
agreement among observers,coders, judges, raters, or measuring
instruments drawing distinctions among typically unstructured
phenomena
● Cohen’s kappa is a measure of the agreement between two raters who
determine which category a finite number of subjects belong to whereby
agreement due to chance is factored out
● Interannotator agreement scoring determines the agreement between
different annotators annotating the same unstructured text
● It is not intended to measure the output of a tool against a gold standard

BASIS TECHNOLOGY
The Steps to Evaluation
● Define your requirements
● Assemble a valid test dataset
● Annotate the gold standard test dataset
● Get output from tools
● Evaluate the results
● Make your decision

[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie [Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

Ähnlich wie [Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction (20)

Mehr von Gil Irizarry

Mehr von Gil Irizarry (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

Hinweis der Redaktion