Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 28 Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie [Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction (20)

Weitere von Gil Irizarry (16)

Anzeige

Aktuellste (20)

[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

  1. 1. How to Evaluate NLP Tools for Entity Extraction Gil Irizarry VP Engineering ODSC East 2020 [Apple | Organization] and [Oranges | Fruit]:
  2. 2. BASIS TECHNOLOGY Agenda ● About Me ● The Problem Space ● Defining the domain ● Assemble a test set ● Annotation Guidelines ● Review of measurement ● Evaluation examples ● Interannotator agreement ● The steps to evaluation
  3. 3. BASIS TECHNOLOGY About Me Gil Irizarry - VP Engineering at Basis Technology, responsible for NLP and Text Analytics software development https://www.linkedin.com/in/gilirizarry/ Basis Technology - leading provider of software solutions for extracting meaningful intelligence from multilingual text and digital devices
  4. 4. BASIS TECHNOLOGY Rosette Capabilities
  5. 5. BASIS TECHNOLOGY The Problem Space ● You have some text to analyze. Which tool to choose? ● Related question: You have multiple text or data annotators. Which are doing a good job? ● The questions are made harder by the tools outputting different formats, analyzing data differently, and annotators interpreting data differently ● Start by defining the problem space
  6. 6. BASIS TECHNOLOGY Defining the domain ● What space are you in? ● More importantly, in what domain will you evaluate tools? ● Are you: ○ Reading news ○ Scanning patents ○ Looking for financial fraud
  7. 7. BASIS TECHNOLOGY Assemble a test set ● NLP systems are often trained on a general corpus. Often this corpus consists of mainstream news articles. ● Do you use this domain or a more specific one? ● If more specific, do you train a custom model?
  8. 8. BASIS TECHNOLOGY Annotation Guidelines Examples requiring definition and agreement in guidelines: ● “Alice shook Brenda’s hand when she entered the meeting.” Is “Brenda” or “Brenda’s” the entity to be extracted (in addition to Alice of course)? ● Are pronouns expected to be extracted and resolved? “She” in the previous example ● What about tolerance to punctuation? The U.N. vs. the UN ● Should fictitious characters (“Harry Potter”) be tagged as “person”? ● When a location appears within an organization’s name, do you tag the location and the organization extracted or just the organization (“San Francisco Association of Realtors”)?
  9. 9. BASIS TECHNOLOGY Annotation Guidelines Examples requiring definition and agreement in guidelines: ● Do you tag the name of a person if it is used as a modifier (“Martin Luther King Jr. Day”)? ● Do you tag “Twitter” in “You could try reaching out to the Twitterverse”? ● Do you tag “Google” in “I googled it, but I couldn’t find any relevant results”? ● When do you include “the” in an entity? ● How do you differentiate between an entity that’s a company name and a product by the same name? {[ORG]The New York Times} was criticized for an article about the {[LOC]Netherlands} in the June 4 edition of {[PRO]The New York Times}. ● “Washington and Moscow continued their negotiations.” Are Washington and Moscow locations or organizations?
  10. 10. BASIS TECHNOLOGY Annotation Guidelines Non-entity extraction issues: ● How many levels of sentiment do you expect? ● Ontology and text classification - what categories do you expect? ● For language identification, are dialects identified as separate languages? What about macrolanguages?
  11. 11. BASIS TECHNOLOGY Annotation Guidelines
  12. 12. BASIS TECHNOLOGY Annotation Guidelines ● Map to Universal Dependencies Guidelines where possible: https://universaldependencies.org/guidelines.html ● Map to DBpedia ontology where possible: http://mappings.dbpedia.org/server/ontology/classes/ ● Map to known database such as Wikidata where possible: https://www.wikidata.org/wiki/Wikidata:Main_Page
  13. 13. BASIS TECHNOLOGY Review of measurement: precision Precision is the fraction of retrieved documents that are relevant to the query
  14. 14. BASIS TECHNOLOGY Review of measurement: recall Recall is the fraction of the relevant documents that are successfully retrieved
  15. 15. BASIS TECHNOLOGY Review of measurement: F-score F-score is a harmonic mean of precision and recall Precision and recall are ratios. In this case, a harmonic mean is more appropriate for an average than an arithmetic mean.
  16. 16. BASIS TECHNOLOGY Review of measurement: harmonic mean A harmonic mean returns a single value to combine both precision and recall. In the below image, a and b map to precision and recall, and H maps to F score. In this example, note that increasing a would not increase the overall score.
  17. 17. BASIS TECHNOLOGY Review of measurement: F-score Previous example of F score was actually an F1 score, which balances precision and recall evenly. A more generalized form of F score is: F2 (β = 2) weights recall higher than precision and F0.5 (β = 0.5) weights precision higher than recall
  18. 18. BASIS TECHNOLOGY Review of measurement: AP and MAP ● Average precision is a measure that combines recall and precision for ranked retrieval results. For one information need, the average precision is the mean of the precision scores after each relevant document is retrieved ● Mean average precision is average precision over a range of queries
  19. 19. BASIS TECHNOLOGY Review of measurement: MUC score ● Message Understanding Conference (MUC) scoring allows for taking partial success into account ○ Correct: response = key ○ Partial: response ~= key ○ Incorrect: response != key ○ Spurious: key is blank and response is not ○ Missing: response is blank and key is not ○ Noncommittal: key and response are both blank ○ Recall = (correct + (partial x 0.5 )) / possible ○ Precision = (correct+(partial x 0.5)) / actual ○ Undergeneration = missing / possible ○ Overgeneration = spurious / actual
  20. 20. BASIS TECHNOLOGY Evaluation Examples As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist.
  21. 21. BASIS TECHNOLOGY Evaluation Examples - gold standard As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist.
  22. 22. BASIS TECHNOLOGY Evaluation Examples - P, R, F As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist. ● (Green) TP = 6 ● (Olive) FP = 1 ● (Orange) TN = 3 ● (Red) FN = 3 ● Precision = 6/7 = .86 ● Recall = 6/9 = .66 ● F score = .74
  23. 23. BASIS TECHNOLOGY Evaluation Examples - AP As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist. ● 1/1 (Green) ● 2/2 (Green) ● 3/3 (Green) ● 0/4 (Red) ● 4/5 (Green) ● 5/6 (Green) ● 0/7 (Red) ● 0/8 (Red) ● 6/9 (Green) ● AP = (1/1 + 2/2 + 3/3 + 4/5 + 5/6 + 6/9) / 6 = .88
  24. 24. BASIS TECHNOLOGY Evaluation Examples - MUC scoring Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Token Gold Eval Result Cook's B-PER B-PER Partial date O-NONE I-PER Spurious for O-NONE O-NONE Correct the O-NONE O-NONE Correct night O-NONE O-NONE Correct was O-NONE O-NONE Correct Laurene B-PER B-PER Correct Powell I-PER I-PER Correct Jobs, I-PER O- NONE Missing the O-NONE O-NONE Correct
  25. 25. BASIS TECHNOLOGY Evaluation Examples - MUC scoring Possible = Correct + Incorrect + Partial + Missing = 11 + 1 + 1 + 1 = 14 Actual = Correct + Incorrect + Partial + Spurious = 11 + 1 + 1 + 2 = 15 Precision = correct + (1/2 partial)) / actual = 12.5 / 15 = 0.83 Recall = (correct + (1/2 partial)) / possible = 12.5 / 14 = 0.89 Token Gold Eval Result Cook's B-PER B-PER Partial date O-NONE I-PER Spurious for O-NONE O-NONE Correct the O-NONE O-NONE Correct night O-NONE O-NONE Correct was O-NONE O-NONE Correct Laurene B-PER B-PER Correct Powell I-PER I-PER Correct Jobs, I-PER O- NONE Missing the O-NONE O-NONE Correct
  26. 26. BASIS TECHNOLOGY Interannotator Agreement ● Krippendorff ’s alpha is a reliability coefficient developed to measure the agreement among observers,coders, judges, raters, or measuring instruments drawing distinctions among typically unstructured phenomena ● Cohen’s kappa is a measure of the agreement between two raters who determine which category a finite number of subjects belong to whereby agreement due to chance is factored out ● Interannotator agreement scoring determines the agreement between different annotators annotating the same unstructured text ● It is not intended to measure the output of a tool against a gold standard
  27. 27. BASIS TECHNOLOGY The Steps to Evaluation ● Define your requirements ● Assemble a valid test dataset ● Annotate the gold standard test dataset ● Get output from tools ● Evaluate the results ● Make your decision
  28. 28. BASIS TECHNOLOGY Thank You!

Hinweis der Redaktion

  • Rosette is a full NLP stack from language identification to morphology to entity extraction and resolution
  • One tool will output 5 levels of sentiment and another only 3. One tool will output transitive vs. intransitive verbs and another will output only verbs. One will strip possessives (King’s Landing) and another won’t.
  • Finding data is easier but annotating data is hard
  • The Ukraine is now Ukraine, similarly Sudan. How do you handle the change over time?
  • Screenshot of the TOC of our Annotation Guidelines. 42 pages. In some meetings, it’s the only doc under NDA. Header says for all. That means for all languages. We also have specific guidelines for some languages.
  • Images from wikipedia
  • Images from wikipedia
  • A harmonic mean is a better balance of two values than a simple average
  • Increasing A would lower the overall score, since both G and H would get smaller
  • Changing the beta value allows you to tune the harmonic mean and weight either precision or recall more heavily
  • https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940-9_482
    Precision is a single value. Average precision takes into account precision over a range of results. Mean average precision is the mean over a range of queries.
  • http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/
    https://pdfs.semanticscholar.org/f898/e821bbf4157d857dc512a85f49610638f1aa.pdf
  • Annotated sample of people names. Note “Cook’s” and “Powell” as references to earlier names. Note the “Emerson Collective” as an organization name is not highlighted.
  • Precision = TP / (TP + FP), Recall = TP / (TP + FN) , F = 2*((P * R)/(P + R))
  • AP = (sum of (True Positive / Predicted Positive)) / num of True Positive
    MAP = is the mean of AP over a range of different queries, for example varying the tolerances or confidences
  • https://pdfs.semanticscholar.org/f898/e821bbf4157d857dc512a85f49610638f1aa.pdf
    http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/
  • Possible: The number of entities hand-annotated in the gold evaluation corpus, equal to (Correct + Incorrect + Partial + Missing)
    Actual: The number of entities tagged by the test NER system, equal to (Correct + Incorrect + Partial + Spurious)
    (R) Recall = (correct + (1/2 partial)) / possible (P) Precision = (correct + (1/2 partial)) / actual
    F =(2 * P * R) / (P + R)
  • http://www.real-statistics.com/reliability/interrater-reliability/cohens-kappa/

×