SlideShare a Scribd company logo
1 of 34
Deep Learning 勉強会@小町研
Learning Character-level
Representations for Part-of-
SpeechTagging
C ́ıcero Nogueira dos Santos, Bianca Zadrozny
IBM Research - Brazil,Av. Pasteur, 138/146 – Rio de Janeiro, 22296-903, Brazil
首都大学東京 情報通信システム学域
小町研究室 M1 塘優旗
2014/12/22 1
Introduction
• Distributed word representations (word embeddings) is an
invaluable resource for NLP
• Word embeddings capture syntactic and semantic information
about words with a few exceptions (Luong et al., 2013)
• character-level information is very important is POS tagging
• Usually, when a task needs morphological or word shape
information, the knowledge is included as handcrafted features
(Collobert, 2011).
• propose a new deep neural network (DNN) architecture that joins
word-level and character-level embeddings to perform POS
tagging
• The proposed DNN, which we callCharWNN
2014/12/22 2
Neural Network
Architecture
• The deep neural network we propose extends Collobert et
al.’s (2011) neural network architecture
• Given a sentence, the network gives for each word a score for
each tag τ ∈T .
• In order to score a word, the network takes as input a fixed-
sized window of words centralized in the target word.
• The novelty in our network architecture is the inclusion of a
convolutional layer to extract character-level representations.
2014/12/22 3
2014/12/22 4
Collobert et al.’s
(2011)
neural network
architecture
http://ronan.collobert.com/pub/matos/201
_aistats.pdf
1st layer of the network
• transforms words into real-valued feature vectors
(embeddings)
2014/12/22 5
a sentence consisting
of N words
every word w_n is converted into a vector which is
composed of two sub-vectors
(word-level-embeddings , character-level-embeddings)
Word-Level-Embeddings
2014/12/22 6
: a fixed-sized word vocabulary
: a vector size of , 1 hot vector for
: hyper-parameter to be chosen by the user
: parameter to be learned
Character-Level-
Embeddings
• convolutional approach
(introduced byWaibel et al.
(1989))
• produces local features
around each character of the
word
• combines them to create a
fixed-sized character-level
embedding of the word.
2014/12/22 7
2014/12/22 8
given a word w composed
of M characters:
transform each character c_m
into a character embedding r^chr
:a fixed-sized character vocabulary
a vector size of , 1 hot vector for C
: parameter to be learned
character embedding
The input for the convolutional layer is
the sequence of character embeddings
• define the vector z for each character embeddings
Convolutional Layer
2014/12/22 9
input :
sequence of character embeddings
:window size
2014/12/22 10
The convolutional layer computes the j-th element of the
vector , which is the character-level embedding of w
:the weight matrix of
the convolutional layer
:parameters to be learned
: # of convolutional units
(which corresponds to the size of the
character-level embedding of a word)
: the size of the character context window
: the size of the character vector
hyper-parameters
Scoring
• follow Collobert et al.’s (2011) window approach to score all
tagsT for each word in a sentence
• the assumption that the tag of a word depends mainly on its
neighboring words
2014/12/22 11
Given a sentence with N words joint word-level and character-
level embedding
convert
• create a vector x_n (padding token) resulting from the
concatenation of a sequence of k^wrd embeddings,
centralized in the n-th word
The size of the context window, hyper-parameters
• the vector x_n is processed by two usual NN layers
2014/12/22 12
transfer function , the hyperbolic tangent : tanh
parameters to
be learned
# of hidden units, hyper-parameter
: score all tags for n-th word
Structured Inference
• the tags of neighboring words are strongly dependent
• prediction scheme that takes into account the sentence
structure (Collobert et al., 2011)
2014/12/22 13
: transition score from tag t to u
Given the sentence, the score for tag path is computed as follows:
infer the predicted sentence tags usingViterbi (1967) algorithm
: the set of all trainable network parameters
NetworkTraining
• network is trained by minimizing a negative likelihood over the
training set D (Collobert, 2011)
• we interpret the sentence score (5) as a conditional probability
over a path
• Equation 7 can be computed efficiently using dynamic
programming (Collobert, 2011)
2014/12/22 14
• use stochastic gradient descent (SGD) to minimize the
negative log-likelihood with respect to θ
Experimental Setup
2014/12/22 15
Preprocessing and
Tokenizing Corpora
• English
o December 2013 snapshot of
the English Wikipedia corpus
clean corpus contains
about 1.75 billion tokens
2014/12/22 16
the source of
unlabeled text
• Portuguese
o the Portuguese Wikipedia
o the CETENFolha corpus
o the CETEMPublico3 corpus
concatenating the three
corpora, the resulting
corpus contains around
401 million words
preprocessing, tokenizing
Unsupervised Learning ofWord-Level
Embeddings using the word2vec tool
• word2vec
o skip-gram method
o context window : size 5
• use the same parameters except for the minimum word
frequency (English and Portuguese)
• minimum word frequency
o English : 10
o Portuguese : 5
• resulting vocabulary size
o English : 870,214 entries
o Portuguese : 453,990 entries
2014/12/22 17
Supervised Learning of
Character-Level Embeddings
• supervised learning (not unsupervised)
• the character vocabulary is quite small if compared to the
word vocabulary
• the amount of data in the labeled POS tagging training
corpora is enough to effectively train character-level
embeddings
• trained using the raw (not lowercased) words
2014/12/22 18
POSTagging Datasets
• English
o Wall Street Journal (WSJ)
portion of the PennTreebank4
(Marcus et al., 1993)
o 45 different POS tags
o same partitions used by other
researchers(Manning, 2011;
Søgaard, 2011; Collobert et al.,
2011)
o to fairly compare our results
with previous work
2014/12/22 19
• Portuguese
o Mac-Morpho corpus (Alu ́ısio
et al., 2003)
o 22 POS tags
o the same training/test
partitions used by (dos Santos
et al., 2008)
o created a development set by
randomly selecting 5% of the
training set sentences.
• out-of-the-supervised-vocabulary words (OOSV)
o not appear in the training set
• out-of-the-unsupervised-vocabulary words (OOUV)
o do not appear in the vocabulary created using the unlabeled data
o words that we do not have word embeddings.
2014/12/22 20
Model Setup
• use the development sets to tune the neural network hyper-
parameters (to be chosen by the user)
• in SGD training, the learning rate is the hyper- parameter
that has the largest impact in the prediction
• use a learning rate schedule that decreases the learning rate
λ according to the training epoch t
2014/12/22 21
Hyper-Parameter
• use the same set of hyper-parameters for both English and
Portuguese
• This provides some indication on the robustness of our approach to
multiple languages.
• The number of training epochs is the only difference in terms of
training setup for the two languages
o TheWSJ corpus : 5 training epochs
o Mac-Morpho corpus : 8 training epochs.
2014/12/22 22
WNN model
(compared with proposed CharWNN)
• In order to assess the effectiveness of the proposed
character-level representation of words
• WNN
o essentiallyCollobert et al.’s (2011) NN architecture
o use the same NN hyper-parameters values (when applicable) shown inTable 3
o represents a network which is fed with word representations only
2014/12/22 23
• it also includes two additional handcrafted features
o capitalization feature
o suffix feature :
• capitalization and suffix embeddings have dimension 5
Experimental Results
2014/12/22 24
Results for POSTagging of
Portuguese Language
• the effectiveness of our convolutional approach to learn valuable
character-level information for POSTagging.
2014/12/22 25
Compare CharWNN with state-
of-the-art (Portuguese)
• compare the result ofCharWNN with reported state-of-the-
art results for the Mac-Morpho corpus
• In both pieces of work, the authors use many handcrafted
features, most of them to deal with out-of-vocabulary words.
• This is a remarkable result, since we train our model from
scratch, i.e., without the use of any handcrafted features.
2014/12/22 26
Portuguese SimilarWords
“Character-Level Embeddings”
• check the effectiveness of the morphological information encoded
in character-level embeddings of words
• The similarity between two words is computed as the cosine
between the two vectors
• the character-level embeddings are very effective for learning
suffixes and prefixes.
2014/12/22 27
OOSV OOUV
Portuguese SimilarWords
“Word-Level Embeddings”
• same five out-of-vocabulary words and their respective most
similar words in the vocabulary extracted from the unlabeled data
• use word-level embeddings to compute the similarity
• the words are more semantically related, some of them being
synonyms
• morphologically less similar than the ones inTable 6.
2014/12/22 28
OOSV OOUV
Results for POSTagging of
English Language
• Using the character-level embeddings of words (CharWNN) we
again get better results than the ones obtained with theWNN that
uses handcrafted features
• our convolutional approach to learning character-level
embeddings can be successfully applied to the English language
2014/12/22 29
Compare CharWNN with state-
of-the-art (English)
• a comparison of our proposedCharWNN results with
reported state-of-the-art results for the WSJ corpus
• this is a remarkable result, since differently from the other
systems, our approach does not use any handcrafted feature
2014/12/22 30
English SimilarWords
“Character-Level Embeddings”
• the similarity between the character-level embeddings of
rare words and words in the training set
• we present six OOSV words, and their respective most similar
words in the training set.
• character-level embedding approach is effective at learning
suffixes, prefixes and even word shapes
2014/12/22 31
English SimilarWords
“Word-Level Embeddings”
• the same six OOSV words and their respective most similar
words in the vocabulary extracted from the unlabeled data
• The similarity is computed using the word-level embeddings
• the retrieved words are morphologically less related than the
ones inTable 10
2014/12/22 32
Conclusion
• The main contributions
1. the idea of using convolutional neural networks to extract character-level
features and jointly use them with word-level features
2. demonstration that it is feasible to train state-of-the-art POS taggers for
different languages using the same model, with the same hyper-parameters,
and without any hand- crafted features
3. the definition of a new state-of-the-art result for Portuguese POS tagging
on the Mac-Morpho corpus
• A weak point
o the introduction of additional hyper-parameters to be tuned
o However, we argue that it is generally preferable to tune hyper-parameters
than to handcraft features, since the former can be automated.
2014/12/22 33
FutureWork
• analyzing in more detail the interrelation between the two
representations: word-level and character-level
• applying the proposed strategy to other NLP tasks
o text chunking, named entity recognition, etc
2014/12/22 34

More Related Content

What's hot

GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsShreyas Suresh Rao
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
02 15034 neural network
02 15034 neural network02 15034 neural network
02 15034 neural networkIAESIJEECS
 
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCompression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentationSurya Sg
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
 

What's hot (20)

GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
 
BERT
BERTBERT
BERT
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
BERT
BERTBERT
BERT
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
02 15034 neural network
02 15034 neural network02 15034 neural network
02 15034 neural network
 
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCompression-Based Parts-of-Speech Tagger for The Arabic Language
Compression-Based Parts-of-Speech Tagger for The Arabic Language
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 

Similar to Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging"

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsJinho Choi
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET Journal
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...Association for Computational Linguistics
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningLena Shakurova
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 

Similar to Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging" (20)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
 
Dorra elmekki nlp
Dorra elmekki nlpDorra elmekki nlp
Dorra elmekki nlp
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 

Recently uploaded

DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 

Recently uploaded (20)

DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging"

  • 1. Deep Learning 勉強会@小町研 Learning Character-level Representations for Part-of- SpeechTagging C ́ıcero Nogueira dos Santos, Bianca Zadrozny IBM Research - Brazil,Av. Pasteur, 138/146 – Rio de Janeiro, 22296-903, Brazil 首都大学東京 情報通信システム学域 小町研究室 M1 塘優旗 2014/12/22 1
  • 2. Introduction • Distributed word representations (word embeddings) is an invaluable resource for NLP • Word embeddings capture syntactic and semantic information about words with a few exceptions (Luong et al., 2013) • character-level information is very important is POS tagging • Usually, when a task needs morphological or word shape information, the knowledge is included as handcrafted features (Collobert, 2011). • propose a new deep neural network (DNN) architecture that joins word-level and character-level embeddings to perform POS tagging • The proposed DNN, which we callCharWNN 2014/12/22 2
  • 3. Neural Network Architecture • The deep neural network we propose extends Collobert et al.’s (2011) neural network architecture • Given a sentence, the network gives for each word a score for each tag τ ∈T . • In order to score a word, the network takes as input a fixed- sized window of words centralized in the target word. • The novelty in our network architecture is the inclusion of a convolutional layer to extract character-level representations. 2014/12/22 3
  • 4. 2014/12/22 4 Collobert et al.’s (2011) neural network architecture http://ronan.collobert.com/pub/matos/201 _aistats.pdf
  • 5. 1st layer of the network • transforms words into real-valued feature vectors (embeddings) 2014/12/22 5 a sentence consisting of N words every word w_n is converted into a vector which is composed of two sub-vectors (word-level-embeddings , character-level-embeddings)
  • 6. Word-Level-Embeddings 2014/12/22 6 : a fixed-sized word vocabulary : a vector size of , 1 hot vector for : hyper-parameter to be chosen by the user : parameter to be learned
  • 7. Character-Level- Embeddings • convolutional approach (introduced byWaibel et al. (1989)) • produces local features around each character of the word • combines them to create a fixed-sized character-level embedding of the word. 2014/12/22 7
  • 8. 2014/12/22 8 given a word w composed of M characters: transform each character c_m into a character embedding r^chr :a fixed-sized character vocabulary a vector size of , 1 hot vector for C : parameter to be learned character embedding The input for the convolutional layer is the sequence of character embeddings
  • 9. • define the vector z for each character embeddings Convolutional Layer 2014/12/22 9 input : sequence of character embeddings :window size
  • 10. 2014/12/22 10 The convolutional layer computes the j-th element of the vector , which is the character-level embedding of w :the weight matrix of the convolutional layer :parameters to be learned : # of convolutional units (which corresponds to the size of the character-level embedding of a word) : the size of the character context window : the size of the character vector hyper-parameters
  • 11. Scoring • follow Collobert et al.’s (2011) window approach to score all tagsT for each word in a sentence • the assumption that the tag of a word depends mainly on its neighboring words 2014/12/22 11 Given a sentence with N words joint word-level and character- level embedding convert • create a vector x_n (padding token) resulting from the concatenation of a sequence of k^wrd embeddings, centralized in the n-th word The size of the context window, hyper-parameters
  • 12. • the vector x_n is processed by two usual NN layers 2014/12/22 12 transfer function , the hyperbolic tangent : tanh parameters to be learned # of hidden units, hyper-parameter : score all tags for n-th word
  • 13. Structured Inference • the tags of neighboring words are strongly dependent • prediction scheme that takes into account the sentence structure (Collobert et al., 2011) 2014/12/22 13 : transition score from tag t to u Given the sentence, the score for tag path is computed as follows: infer the predicted sentence tags usingViterbi (1967) algorithm : the set of all trainable network parameters
  • 14. NetworkTraining • network is trained by minimizing a negative likelihood over the training set D (Collobert, 2011) • we interpret the sentence score (5) as a conditional probability over a path • Equation 7 can be computed efficiently using dynamic programming (Collobert, 2011) 2014/12/22 14 • use stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to θ
  • 16. Preprocessing and Tokenizing Corpora • English o December 2013 snapshot of the English Wikipedia corpus clean corpus contains about 1.75 billion tokens 2014/12/22 16 the source of unlabeled text • Portuguese o the Portuguese Wikipedia o the CETENFolha corpus o the CETEMPublico3 corpus concatenating the three corpora, the resulting corpus contains around 401 million words preprocessing, tokenizing
  • 17. Unsupervised Learning ofWord-Level Embeddings using the word2vec tool • word2vec o skip-gram method o context window : size 5 • use the same parameters except for the minimum word frequency (English and Portuguese) • minimum word frequency o English : 10 o Portuguese : 5 • resulting vocabulary size o English : 870,214 entries o Portuguese : 453,990 entries 2014/12/22 17
  • 18. Supervised Learning of Character-Level Embeddings • supervised learning (not unsupervised) • the character vocabulary is quite small if compared to the word vocabulary • the amount of data in the labeled POS tagging training corpora is enough to effectively train character-level embeddings • trained using the raw (not lowercased) words 2014/12/22 18
  • 19. POSTagging Datasets • English o Wall Street Journal (WSJ) portion of the PennTreebank4 (Marcus et al., 1993) o 45 different POS tags o same partitions used by other researchers(Manning, 2011; Søgaard, 2011; Collobert et al., 2011) o to fairly compare our results with previous work 2014/12/22 19 • Portuguese o Mac-Morpho corpus (Alu ́ısio et al., 2003) o 22 POS tags o the same training/test partitions used by (dos Santos et al., 2008) o created a development set by randomly selecting 5% of the training set sentences.
  • 20. • out-of-the-supervised-vocabulary words (OOSV) o not appear in the training set • out-of-the-unsupervised-vocabulary words (OOUV) o do not appear in the vocabulary created using the unlabeled data o words that we do not have word embeddings. 2014/12/22 20
  • 21. Model Setup • use the development sets to tune the neural network hyper- parameters (to be chosen by the user) • in SGD training, the learning rate is the hyper- parameter that has the largest impact in the prediction • use a learning rate schedule that decreases the learning rate λ according to the training epoch t 2014/12/22 21
  • 22. Hyper-Parameter • use the same set of hyper-parameters for both English and Portuguese • This provides some indication on the robustness of our approach to multiple languages. • The number of training epochs is the only difference in terms of training setup for the two languages o TheWSJ corpus : 5 training epochs o Mac-Morpho corpus : 8 training epochs. 2014/12/22 22
  • 23. WNN model (compared with proposed CharWNN) • In order to assess the effectiveness of the proposed character-level representation of words • WNN o essentiallyCollobert et al.’s (2011) NN architecture o use the same NN hyper-parameters values (when applicable) shown inTable 3 o represents a network which is fed with word representations only 2014/12/22 23 • it also includes two additional handcrafted features o capitalization feature o suffix feature : • capitalization and suffix embeddings have dimension 5
  • 25. Results for POSTagging of Portuguese Language • the effectiveness of our convolutional approach to learn valuable character-level information for POSTagging. 2014/12/22 25
  • 26. Compare CharWNN with state- of-the-art (Portuguese) • compare the result ofCharWNN with reported state-of-the- art results for the Mac-Morpho corpus • In both pieces of work, the authors use many handcrafted features, most of them to deal with out-of-vocabulary words. • This is a remarkable result, since we train our model from scratch, i.e., without the use of any handcrafted features. 2014/12/22 26
  • 27. Portuguese SimilarWords “Character-Level Embeddings” • check the effectiveness of the morphological information encoded in character-level embeddings of words • The similarity between two words is computed as the cosine between the two vectors • the character-level embeddings are very effective for learning suffixes and prefixes. 2014/12/22 27 OOSV OOUV
  • 28. Portuguese SimilarWords “Word-Level Embeddings” • same five out-of-vocabulary words and their respective most similar words in the vocabulary extracted from the unlabeled data • use word-level embeddings to compute the similarity • the words are more semantically related, some of them being synonyms • morphologically less similar than the ones inTable 6. 2014/12/22 28 OOSV OOUV
  • 29. Results for POSTagging of English Language • Using the character-level embeddings of words (CharWNN) we again get better results than the ones obtained with theWNN that uses handcrafted features • our convolutional approach to learning character-level embeddings can be successfully applied to the English language 2014/12/22 29
  • 30. Compare CharWNN with state- of-the-art (English) • a comparison of our proposedCharWNN results with reported state-of-the-art results for the WSJ corpus • this is a remarkable result, since differently from the other systems, our approach does not use any handcrafted feature 2014/12/22 30
  • 31. English SimilarWords “Character-Level Embeddings” • the similarity between the character-level embeddings of rare words and words in the training set • we present six OOSV words, and their respective most similar words in the training set. • character-level embedding approach is effective at learning suffixes, prefixes and even word shapes 2014/12/22 31
  • 32. English SimilarWords “Word-Level Embeddings” • the same six OOSV words and their respective most similar words in the vocabulary extracted from the unlabeled data • The similarity is computed using the word-level embeddings • the retrieved words are morphologically less related than the ones inTable 10 2014/12/22 32
  • 33. Conclusion • The main contributions 1. the idea of using convolutional neural networks to extract character-level features and jointly use them with word-level features 2. demonstration that it is feasible to train state-of-the-art POS taggers for different languages using the same model, with the same hyper-parameters, and without any hand- crafted features 3. the definition of a new state-of-the-art result for Portuguese POS tagging on the Mac-Morpho corpus • A weak point o the introduction of additional hyper-parameters to be tuned o However, we argue that it is generally preferable to tune hyper-parameters than to handcraft features, since the former can be automated. 2014/12/22 33
  • 34. FutureWork • analyzing in more detail the interrelation between the two representations: word-level and character-level • applying the proposed strategy to other NLP tasks o text chunking, named entity recognition, etc 2014/12/22 34

Editor's Notes

  1. Distributed word representations, also known as word embeddings, have recently been proven to be an invaluable resource for natural language processing (NLP). These representations capture syntactic and semantic information about words. With a few exceptions (Luong et al., 2013), work on learning of representations for NLP has focused exclusively on the word level. However, information about word morphology and shape is crucial for various NLP tasks. (part-of-speech (POS) tagging) Usually, when a task needs morphological or word shape information, the knowledge is included as handcrafted features (Collobert, 2011). One task for which character-level information is very important is part-of-speech (POS) tagging, which consists in labeling each word in a text with a unique POS tag, e.g. noun, verb, pronoun and preposition. In this paper we propose a new deep neural network (DNN) architecture that joins word-level and character- level representations to perform POS tagging. The pro- posed DNN, which we call CharWNN, uses a convolu- tional layer that allows effective feature extraction from words of any size.
  2. The deep neural network we propose extends Collobert et al.’s (2011) neural network architecture, which is a variant(異なった) of the architecture first proposed by Bengio et al. (2003). Given a sentence, the network gives for each word a score for each tag τ ∈ T . In order to score a word, the network takes as input a fixed-sized window of words centralized in the target word. The novelty in our network architecture is the inclusion of a convolutional layer to extract character-level representa- tions.
  3. Robust methods to extract morphological information from words must take into consideration all characters of the word and select which features are more important for the task at hand. in the POS tagging task, infor- mative features may appear in the beginning (like the prefix “un” in “unfortunate”), in the middle (like the hyphen in “self-sufficient” and the “h” in “10h30”), or at the end (like suffix “ly” in “constantly”). In order to tackle this problem we use a convolutional approach, which has been first in- troduced by Waibel et al. (1989).
  4. The convolutional layer applies a matrix-vector operationto each window of size kchr of successive windows int he sequence {rchr,rchr,...,rchr}. r^wch のj番目という記述がよくわからない、上記の式はclu次元になるはず
  5. follow Collobert et al.’s (2011) window approach to score all tags T for each word in a sentence. the tag of a word depends mainly on its neighboring words, for pos tagging Given a sentence with N words {w1, w2, ..., wN }, which have been converted to joint word-level and character-level embedding {u1 , u2 , ..., uN }, to compute tag scores for the n-th word in the sentence,
  6. the tags of neighboring words are strongly dependent Therefore, a sentence-wise tag inference that captures structural information from the sentence can deal better with tag dependencies. Like in (Collobert et al., 2011), we use a prediction scheme that takes into account the sentence structure. a transition score At,u for jumping from tag t ∈ T to u ∈ T in successive words,
  7. network is trained by minimizing a negative likelihood over the training set D. we interpret the sentence score (5) as a conditional probability over a path Equation 7 can be computed efficiently using dynamic programming use stochastic gradient descent (SGD 確率的勾配降下法 ) to minimize the negative log-likelihood with respect to θ
  8. 4.1. Unsupervised Learning of Word-Level Embeddings ラベル付けされていないコーパスからWord-Level Embeddingsを教師なし学習 Word-level embeddings capture syntactic and semantic information, which are crucial to POS tagging. We perform unsupervised learning of word-level embeddings using the word2vec tool 各言語におけるコーパス,前処理 English the source of unlabeled English text: December 2013 snapshot of the English Wikipedia corpus The corpus was processed using the following steps: (1) remove paragraphs that are not in English; (2) substitute non-roman characters for a special character; 特殊文字を非ローマ字(非英字)で置き換える (3) tokenize the text using the tokenizer available with the Stanford POS Tagger (Manning, 2011); (4) remove sentences that are less than 20 characters long (including white spaces) or have less than 5 tokens. Like in (Collobert et al., 2011) and (Luong et al., 2013), lowercase all words substitute each numerical digit by a 0 (for instance, 1967 becomes 0000). ↓上記のような処理を経て clean corpus contains about 1.75 billion tokens. Portuguese three sources of unlabeled text: the Portuguese Wikipedia; preprocessed using the same approach applied to the English Wikipedia corpus, How- ever, we implemented our own Portuguese tokenizer. the CETENFolha corpus; the CETEMPublico3 corpus distributed in a tokenized format We only include the parts of the corpora tagged as sentences. 加えて、上記のstep(4)の処理も行う。 When concatenating the three corpora, the resulting corpus contains around 401 million words.
  9. word2vecs の利用 For both languages we use the same parameters when run- ning the word2vec tool, except for the minimum word frequency. English a word must occur at least 10 times in order to be included in the vocabulary, which resulted in a vocabulary of 870,214 entries. Portuguese we use a minimum frequency of 5, which resulted in a vocabulary of 453,990 entries. To train our word-level embeddings we use word2vec’s skip-gram method with a context window of size five.
  10. character-level embeddings. do not perform unsupervised learning of character-level embeddings. the character vocabulary is quite small if compared to the word vocabulary assume that the amount of data in the labeled POS tagging training corpora is enough to ef- fectively train character-level embeddings. character-level embeddings are trained using the raw (not lowercased) words
  11. 4.2 POS Tagging Datasets POSタグ付けのためのデータセット English Wall Street Journal (WSJ) portion of the Penn Treebank4 (Marcus et al., 1993) adopt the same partitions used by other researchers(Manning, 2011; Søgaard, 2011; Collobert et al., 2011) to fairly compare our results with previous work 45 different POS tags Portuguese Mac-Morpho corpus (Alu ́ısio et al., 2003) contains around 1.2 million manually tagged words contains 22 POS tags (41 if punc- tuation marks are included) and 10 more tags that represent additional semantic aspects We carry out tests without using the 10 additional tags. the same training/test partitions used by (dos Santos et al., 2008) created a development set by randomly selecting 5% of the training set sentences.
  12. out-of-the-supervised-vocabulary words (OOSV) A word is considered OOSV if it does not appear in the training set out-of-the-unsupervised-vocabulary words (OOUV) OOUV words are the ones that do not appear in the vocabulary created using the unlabeled data (as described in Sec. 4.1), words for which we do not have word embeddings.
  13. In order to assess the effectiveness of the proposed character-level representation of words, we compare the proposed architecture CharWNN with an architecture that uses only word embeddings and additional features instead of character-level embeddings of words. WNN WNN is essentially Collobert et al.’s (2011) NN architecture. use the same NN hyper-parameters values (when applicable) shown in Table 3 represents a network which is fed with word representations only, for each word its embedding is it also includes two additional handcrafted features : In our experiments, both capitalization and suffix embeddings have dimension five. capitalization feature : has five possible values all lowercased, first uppercased, all uppercased, contains an upper- cased letter, all other cases. suffix feature : suffix of size two for English and of size three for Portuguese
  14. we report POS Tagging accuracy (Acc.) WNN, which does not use character-level embeddings, performs very poorly without the use of the capitalization feature. This happens because we do not have information about capitalization in our word embeddings, since we use lowercased words. When taking into account all words in the test set (column Acc.), CharWNN achieves a slightly better result than the one of WNN with two handcrafted features. Regarding out-of-vocabulary words, we can see that intra-word information is essential to better performance. For the subset of words for which we do not have word embeddings (column Acc. OOUV), the use of character-level information by CharWNN is responsible for an error reduction of 58% when compared to WNN without intra-word information (from 75.40% to 89.74%). These results demonstrate the effectiveness of our convolutional approach to learn valuable character-level information for POS Tagging.
  15. 4.4.1. PORTUGUESE CHARACTER-LEVEL EMBEDDINGS We can check the effectiveness of the morphological information encoded in character-level embeddings of words by computing the similarity between the embeddings of rare words and words in the training set. In Table 6, we present four OOSV words and one OOUV word (“drograsse”), and their respective most similar words in the training set. The similarity between two words w_i and w_j is computed as the cosine between the two vectors r_i^wch and i r_j^wch (character-level embeddings) We can see in Table 6 j that the character-level embeddings are very effective for learning suffixes and prefixes.
  16. In Table 7, we present the same five out-of-vocabulary words and their respective most similar words in the vocabulary extracted from the unlabeled data we use word-level embeddings r^wrd to compute the similarity the words are more semantically related, some of them being synonyms(同意語,類義語) morphologically less similar than the ones in Table 6.
  17. the similarity between the character-level embeddings of rare words and words in the training set character-level embedding approach is effective at learning suffixes, prefixes and even word shapes we present six OOSV words, and their respective most similar words in the training set. the similarity between two words is com- puted as the cosine between the two character-level embed- dings of the words.
  18. the same six OOSV words and their respec- tive most similar words in the vocabulary extracted from the unlabeled data. The similarity is computed using the word-level embeddings (rwrd). in Table 10 the retrieved words are morphologically more related than the ones in Table 11 Notice that due to our processing of the unlabeled data, we have to replace all numerical digits by 0 in order to look for the word-level embeddings of 83-year- old and 0.0055.
  19. analyzing in more detail the interrelation(相互関係) between the two representations: word-level and character-level