Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

POLYGLOT-NER: Massive Multilingual Named Entity Recognition

14.525 Aufrufe

Veröffentlicht am

The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase. Our approach does not require NER human annotated datasets or language specific resources like treebanks, parallel corpora, and orthographic rules. The novelty of approach lies therein - using only language agnostic techniques, while achieving competitive performance.
Our method learns distributed word representations (word embeddings) which encode semantic and syntactic features of words in each language. Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes. Finally, we apply two preprocessing stages (oversampling and exact surface form matching) which do not require any linguistic expertise.
Our evaluation is two fold: First, we demonstrate the system performance on human annotated datasets. Second, for languages where no gold-standard benchmarks are available, we propose a new method, distant evaluation, based on statistical machine translation.

Veröffentlicht in: Wissenschaft

POLYGLOT-NER: Massive Multilingual Named Entity Recognition

  1. 1. Polyglot-NER: Massive Multilingual Named Entity Recognition SDM May 2, 2015 Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena Stony Brook University
  2. 2. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Named Entity Recognition (NER) Problem ■Input: Plain text, T ■Output: The spans of T that constitute proper names, and the classification of the entity’s type.
  3. 3. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Examples Input: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Output: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Location Location Person
  4. 4. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual NER ❑NLTK ■ English ❑Stanford ■ English, Spanish, Chinese, Arabic ❑OpenNLP ■ English, German, Dutch, Spanish ❑Polyglot-NER ■ 40 Major Languages! (English, Spanish, French, German, Russian, Polish, Portuguese, Italian, Dutch, Arabic, Hebrew, Hindi, Korean, Japanese, Vietnamese, …) While many pipelines exist, most languages are unsupported
  5. 5. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Does Multilingual Matter? Yes! Only 55% of the top 10 million websites are in English! [1] There are 51 languages on Wikipedia with 100,000+ articles. [2] [1] http://w3techs.com/technologies/history_overview/content_language/ms/y [2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
  6. 6. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual is Hard Feature Scarcity NLP tasks typically rely on language-specific feature engineering ❑ Orthographic features ❑ Part of Speech Tags ❑ Parallel Corpora ❑ WordNet Annotation Scarcity Need NER examples - labeled data is expensive. Our solution: neural word embeddings. Our solution: Wikipedia/Freebase for training examples
  7. 7. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-problem: Word Representation Input: Unstructured text Output: Low dimensional word embeddings
  8. 8. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distributed Word Representations Big Idea: Give similar words similar representations pine oak rose daisy reading writing read write |V| |V|: size of vocabulary pine oak rose daisy reading writing read write d d << |V| Similar words share similar representations. Latent Dimensions Explicit Dimensions
  9. 9. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Polyglot Embeddings ● Wikipedia article text ● 137 Languages ● Available: ○ http://bit.ly/embeddings [Al-Rfou, Perozzi, Skiena, 13] C Imagination C is C greater C than C detail Score Hidden Layer H Projection Layer
  10. 10. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-Problem: Annotation Mining Input: Wikipedia, Freebase Output: Labeled NER training examples
  11. 11. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Related Work
  12. 12. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Annotations from Wikipedia Inter-wiki links are a great potential source of mentions. WikipediaFreebase Freebase tells us which articles are entity articles.
  13. 13. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Example Wiki Text: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. “Vancouver” “British Columbia” “Gregor Robertson” Strings /m/080h2 /m/015jr /m/0grlms Freebase MID City Region Person Freebase Category Location Location Person NER Label
  14. 14. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Bad News Many false negatives in our dataset! ■ Wikipedia editors annotate only the first mention of an entity but not later ones. ■ Most of the named entity mentions are not linked! Example: Vancouver is a coastal seaport city on the mainland of British Columbia. Vancouver’s mayor is Gregor Robertson.
  15. 15. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Good News Positive labels are very high quality! Need to emphasize this in our training. ? ? ? ? ? ? ? ‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
  16. 16. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The trick: Oversampling p We can change the label distribution by oversampling from the positive labels. p is the percentage of positive labels in the training dataset. Initially no oversampling p = 0.5, much better
  17. 17. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Cross-Domain Performance Oversampling Oversampling + Exact Matching Cross-Domain Testing on CoNLL
  18. 18. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Demo @ http://bit.ly/polyglot-ner Legend: Location Organization Person
  19. 19. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition But How to Evaluate? ■We have labeled data for a few languages ■Would like to evaluate everything
  20. 20. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distant Evaluation John proviene de la ciudad de Nueva York. John is coming from New York City. Machine Translation Calculate the error of omitting entities and the error of adding entities. Person: 1 Location: 1 Organization: 0 Person: 0 Location: 1 Organization: 1 1 1
  21. 21. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Experimental Design Distant Evaluation for Polyglot-NER: 1. Annotate English Wikipedia sentences using Stanford NER. 2. Randomly pick 1500 sentences that have at least one entity detected. 3. Translate these sentences using Google translate to 40 languages. 4. Run Polyglot-NER on the translated datasets. 5. Compare the number of entity chunks our annotators found to the ones detected by Stanford per sentence. 6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
  22. 22. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Effect of Data Size ■ Size of training data matters! ■ Tokenization is quite important when the word embeddings coverage is limited. # Words (Log Scale) ErrorMissing More Data Will Help Anomalies Good
  23. 23. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Performance by Category ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error Person Location
  24. 24. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Limitations ■Named entities don’t always translate well: ❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …” ■Need a working translation system for the language
  25. 25. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Take-aways ■NER in 40 languages! ■Word embeddings & oversampling offers equal or better performance to feature engineering for NER annotation mining. ■Translation based evaluation?
  26. 26. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Thanks! NER Demo: http://bit.ly/polyglot-ner NER Code: http://polyglot-nlp.com bperozzi@cs.stonybrook.edu www.perozzi.net Bryan Perozzi

×