SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Polyglot-NER: Massive Multilingual
Named Entity Recognition
SDM
May 2, 2015
Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena
Stony Brook University
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Named Entity Recognition (NER) Problem
■Input:
Plain text, T
■Output:
The spans of T that constitute proper names,
and the classification of the entity’s type.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Examples
Input: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Output: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Location
Location Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual NER
❑NLTK
■ English
❑Stanford
■ English, Spanish,
Chinese, Arabic
❑OpenNLP
■ English, German, Dutch,
Spanish
❑Polyglot-NER
■ 40 Major Languages!
(English, Spanish, French, German,
Russian, Polish, Portuguese, Italian,
Dutch, Arabic, Hebrew, Hindi, Korean,
Japanese, Vietnamese, …)
While many pipelines exist, most languages are unsupported
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Does Multilingual Matter?
Yes!
Only 55% of the top 10 million websites are in English! [1]
There are 51 languages on Wikipedia with 100,000+
articles. [2]
[1] http://w3techs.com/technologies/history_overview/content_language/ms/y
[2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual is Hard
Feature Scarcity
NLP tasks typically rely on
language-specific feature
engineering
❑ Orthographic features
❑ Part of Speech Tags
❑ Parallel Corpora
❑ WordNet
Annotation Scarcity
Need NER examples -
labeled data is expensive.
Our solution: neural word
embeddings.
Our solution:
Wikipedia/Freebase for training
examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-problem: Word Representation
Input: Unstructured text
Output: Low dimensional word embeddings
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distributed Word Representations
Big Idea: Give similar words similar representations
pine
oak
rose
daisy
reading
writing
read
write
|V|
|V|: size of vocabulary
pine
oak
rose
daisy
reading
writing
read
write
d
d << |V|
Similar words share similar
representations.
Latent
Dimensions
Explicit
Dimensions
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Polyglot Embeddings
● Wikipedia article text
● 137 Languages
● Available:
○ http://bit.ly/embeddings
[Al-Rfou, Perozzi, Skiena, 13] C
Imagination
C
is
C
greater
C
than
C
detail
Score
Hidden
Layer
H
Projection
Layer
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-Problem: Annotation Mining
Input: Wikipedia, Freebase
Output: Labeled NER training examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Related Work
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Annotations from Wikipedia
Inter-wiki links are a great
potential source of mentions.
WikipediaFreebase
Freebase tells us which articles
are entity articles.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Example
Wiki Text:
Vancouver is a coastal seaport city on the mainland of
British Columbia. The city's mayor is Gregor Robertson.
“Vancouver”
“British Columbia”
“Gregor Robertson”
Strings
/m/080h2
/m/015jr
/m/0grlms
Freebase MID
City
Region
Person
Freebase
Category
Location
Location
Person
NER Label
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Bad News
Many false negatives in our dataset!
■ Wikipedia editors annotate only the first mention of
an entity but not later ones.
■ Most of the named entity mentions are not linked!
Example:
Vancouver is a coastal seaport city on the
mainland of British Columbia. Vancouver’s
mayor is Gregor Robertson.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Good News
Positive labels are very
high quality!
Need to emphasize this in
our training.
?
?
?
?
?
?
?
‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The trick: Oversampling
p
We can change the label
distribution by
oversampling from the
positive labels.
p is the percentage of positive
labels in the training dataset.
Initially no
oversampling
p = 0.5, much
better
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Cross-Domain Performance
Oversampling
Oversampling +
Exact Matching
Cross-Domain Testing on CoNLL
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Demo
@ http://bit.ly/polyglot-ner
Legend: Location Organization Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
But How to Evaluate?
■We have labeled data for a few languages
■Would like to evaluate everything
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distant Evaluation
John proviene de la ciudad de
Nueva York.
John is coming from New York City.
Machine
Translation
Calculate the error of omitting entities and the error of adding entities.
Person: 1
Location: 1
Organization: 0
Person: 0
Location: 1
Organization: 1
1
1
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Experimental Design
Distant Evaluation for Polyglot-NER:
1. Annotate English Wikipedia sentences using Stanford NER.
2. Randomly pick 1500 sentences that have at least one entity detected.
3. Translate these sentences using Google translate to 40 languages.
4. Run Polyglot-NER on the translated datasets.
5. Compare the number of entity chunks our annotators found to the
ones detected by Stanford per sentence.
6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Effect of Data Size
■ Size of training data
matters!
■ Tokenization is quite
important when the
word embeddings
coverage is limited.
# Words (Log Scale)
ErrorMissing
More
Data Will
Help
Anomalies
Good
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Performance by Category
ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error
Person Location
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Limitations
■Named entities don’t always translate well:
❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …”
■Need a working translation system for the language
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Take-aways
■NER in 40 languages!
■Word embeddings & oversampling offers equal
or better performance to feature engineering for
NER annotation mining.
■Translation based evaluation?
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Thanks!
NER Demo: http://bit.ly/polyglot-ner
NER Code: http://polyglot-nlp.com
bperozzi@cs.stonybrook.edu
www.perozzi.net
Bryan Perozzi

Weitere ähnliche Inhalte

Was ist angesagt?

GC Tuning in the HotSpot Java VM - a FISL 10 Presentation
GC Tuning in the HotSpot Java VM - a FISL 10 PresentationGC Tuning in the HotSpot Java VM - a FISL 10 Presentation
GC Tuning in the HotSpot Java VM - a FISL 10 Presentation
Ludovic Poitou
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 

Was ist angesagt? (12)

Concurrency With Go
Concurrency With GoConcurrency With Go
Concurrency With Go
 
Advanced c programming in Linux
Advanced c programming in Linux Advanced c programming in Linux
Advanced c programming in Linux
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
GC Tuning in the HotSpot Java VM - a FISL 10 Presentation
GC Tuning in the HotSpot Java VM - a FISL 10 PresentationGC Tuning in the HotSpot Java VM - a FISL 10 Presentation
GC Tuning in the HotSpot Java VM - a FISL 10 Presentation
 
Cross site scripting (xss) attacks issues and defense - by sandeep kumbhar
Cross site scripting (xss) attacks issues and defense - by sandeep kumbharCross site scripting (xss) attacks issues and defense - by sandeep kumbhar
Cross site scripting (xss) attacks issues and defense - by sandeep kumbhar
 
Cryptography 101 for Java Developers - JavaZone2019
Cryptography 101 for Java Developers - JavaZone2019Cryptography 101 for Java Developers - JavaZone2019
Cryptography 101 for Java Developers - JavaZone2019
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Understanding the Python GIL
Understanding the Python GILUnderstanding the Python GIL
Understanding the Python GIL
 
논문에 1인칭(i 와 we) 사용여부
논문에 1인칭(i 와 we) 사용여부논문에 1인칭(i 와 we) 사용여부
논문에 1인칭(i 와 we) 사용여부
 
CNIT 152: 12b Windows Registry
CNIT 152: 12b Windows RegistryCNIT 152: 12b Windows Registry
CNIT 152: 12b Windows Registry
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 

Andere mochten auch

JULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de Negocios
Julio Pari
 
1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda
Geohistoria23
 

Andere mochten auch (20)

Currículo Nacional de la Educación Básica
Currículo Nacional de la Educación BásicaCurrículo Nacional de la Educación Básica
Currículo Nacional de la Educación Básica
 
Portafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica DocentePortafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica Docente
 
JULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de Negocios
 
El emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional certEl emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional cert
 
PMP Sonora Saludable 2010 2015
PMP Sonora Saludable 2010   2015  PMP Sonora Saludable 2010   2015
PMP Sonora Saludable 2010 2015
 
Tears In The Rain
Tears In The RainTears In The Rain
Tears In The Rain
 
1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda
 
Onderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitiefOnderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitief
 
Como hacer un plan de negocios
Como hacer un plan de negociosComo hacer un plan de negocios
Como hacer un plan de negocios
 
Schrijven voor het web
Schrijven voor het webSchrijven voor het web
Schrijven voor het web
 
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicas
 
Cápsula 1. estudios de mercado
Cápsula 1. estudios de mercadoCápsula 1. estudios de mercado
Cápsula 1. estudios de mercado
 
Rodriguez alvarez
Rodriguez alvarezRodriguez alvarez
Rodriguez alvarez
 
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
 
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
 
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
 
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
 
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
 
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
 

Kürzlich hochgeladen

Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Kürzlich hochgeladen (20)

Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 

POLYGLOT-NER: Massive Multilingual Named Entity Recognition

  • 1. Polyglot-NER: Massive Multilingual Named Entity Recognition SDM May 2, 2015 Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena Stony Brook University
  • 2. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Named Entity Recognition (NER) Problem ■Input: Plain text, T ■Output: The spans of T that constitute proper names, and the classification of the entity’s type.
  • 3. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Examples Input: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Output: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Location Location Person
  • 4. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual NER ❑NLTK ■ English ❑Stanford ■ English, Spanish, Chinese, Arabic ❑OpenNLP ■ English, German, Dutch, Spanish ❑Polyglot-NER ■ 40 Major Languages! (English, Spanish, French, German, Russian, Polish, Portuguese, Italian, Dutch, Arabic, Hebrew, Hindi, Korean, Japanese, Vietnamese, …) While many pipelines exist, most languages are unsupported
  • 5. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Does Multilingual Matter? Yes! Only 55% of the top 10 million websites are in English! [1] There are 51 languages on Wikipedia with 100,000+ articles. [2] [1] http://w3techs.com/technologies/history_overview/content_language/ms/y [2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
  • 6. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual is Hard Feature Scarcity NLP tasks typically rely on language-specific feature engineering ❑ Orthographic features ❑ Part of Speech Tags ❑ Parallel Corpora ❑ WordNet Annotation Scarcity Need NER examples - labeled data is expensive. Our solution: neural word embeddings. Our solution: Wikipedia/Freebase for training examples
  • 7. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-problem: Word Representation Input: Unstructured text Output: Low dimensional word embeddings
  • 8. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distributed Word Representations Big Idea: Give similar words similar representations pine oak rose daisy reading writing read write |V| |V|: size of vocabulary pine oak rose daisy reading writing read write d d << |V| Similar words share similar representations. Latent Dimensions Explicit Dimensions
  • 9. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Polyglot Embeddings ● Wikipedia article text ● 137 Languages ● Available: ○ http://bit.ly/embeddings [Al-Rfou, Perozzi, Skiena, 13] C Imagination C is C greater C than C detail Score Hidden Layer H Projection Layer
  • 10. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-Problem: Annotation Mining Input: Wikipedia, Freebase Output: Labeled NER training examples
  • 11. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Related Work
  • 12. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Annotations from Wikipedia Inter-wiki links are a great potential source of mentions. WikipediaFreebase Freebase tells us which articles are entity articles.
  • 13. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Example Wiki Text: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. “Vancouver” “British Columbia” “Gregor Robertson” Strings /m/080h2 /m/015jr /m/0grlms Freebase MID City Region Person Freebase Category Location Location Person NER Label
  • 14. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Bad News Many false negatives in our dataset! ■ Wikipedia editors annotate only the first mention of an entity but not later ones. ■ Most of the named entity mentions are not linked! Example: Vancouver is a coastal seaport city on the mainland of British Columbia. Vancouver’s mayor is Gregor Robertson.
  • 15. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Good News Positive labels are very high quality! Need to emphasize this in our training. ? ? ? ? ? ? ? ‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
  • 16. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The trick: Oversampling p We can change the label distribution by oversampling from the positive labels. p is the percentage of positive labels in the training dataset. Initially no oversampling p = 0.5, much better
  • 17. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Cross-Domain Performance Oversampling Oversampling + Exact Matching Cross-Domain Testing on CoNLL
  • 18. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Demo @ http://bit.ly/polyglot-ner Legend: Location Organization Person
  • 19. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition But How to Evaluate? ■We have labeled data for a few languages ■Would like to evaluate everything
  • 20. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distant Evaluation John proviene de la ciudad de Nueva York. John is coming from New York City. Machine Translation Calculate the error of omitting entities and the error of adding entities. Person: 1 Location: 1 Organization: 0 Person: 0 Location: 1 Organization: 1 1 1
  • 21. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Experimental Design Distant Evaluation for Polyglot-NER: 1. Annotate English Wikipedia sentences using Stanford NER. 2. Randomly pick 1500 sentences that have at least one entity detected. 3. Translate these sentences using Google translate to 40 languages. 4. Run Polyglot-NER on the translated datasets. 5. Compare the number of entity chunks our annotators found to the ones detected by Stanford per sentence. 6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
  • 22. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Effect of Data Size ■ Size of training data matters! ■ Tokenization is quite important when the word embeddings coverage is limited. # Words (Log Scale) ErrorMissing More Data Will Help Anomalies Good
  • 23. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Performance by Category ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error Person Location
  • 24. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Limitations ■Named entities don’t always translate well: ❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …” ■Need a working translation system for the language
  • 25. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Take-aways ■NER in 40 languages! ■Word embeddings & oversampling offers equal or better performance to feature engineering for NER annotation mining. ■Translation based evaluation?
  • 26. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Thanks! NER Demo: http://bit.ly/polyglot-ner NER Code: http://polyglot-nlp.com bperozzi@cs.stonybrook.edu www.perozzi.net Bryan Perozzi