Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging"

Deep Learning 勉強会@小町研
Learning Character-level
Representations for Part-of-
SpeechTagging
C ́ıcero Nogueira dos Santos, Bianca Zadrozny
IBM Research - Brazil,Av. Pasteur, 138/146 – Rio de Janeiro, 22296-903, Brazil
首都大学東京情報通信システム学域
小町研究室 M1 塘優旗
2014/12/22 1

Introduction
• Distributed word representations (word embeddings) is an
invaluable resource for NLP
• Word embeddings capture syntactic and semantic information
about words with a few exceptions (Luong et al., 2013)
• character-level information is very important is POS tagging
• Usually, when a task needs morphological or word shape
information, the knowledge is included as handcrafted features
(Collobert, 2011).
• propose a new deep neural network (DNN) architecture that joins
word-level and character-level embeddings to perform POS
tagging
• The proposed DNN, which we callCharWNN
2014/12/22 2

Neural Network
Architecture
• The deep neural network we propose extends Collobert et
al.’s (2011) neural network architecture
• Given a sentence, the network gives for each word a score for
each tag τ ∈T .
• In order to score a word, the network takes as input a fixed-
sized window of words centralized in the target word.
• The novelty in our network architecture is the inclusion of a
convolutional layer to extract character-level representations.
2014/12/22 3

2014/12/22 4
Collobert et al.’s
(2011)
neural network
architecture
http://ronan.collobert.com/pub/matos/201
_aistats.pdf

1st layer of the network
• transforms words into real-valued feature vectors
(embeddings)
2014/12/22 5
a sentence consisting
of N words
every word w_n is converted into a vector which is
composed of two sub-vectors
(word-level-embeddings , character-level-embeddings)

Word-Level-Embeddings
2014/12/22 6
: a fixed-sized word vocabulary
: a vector size of , 1 hot vector for
: hyper-parameter to be chosen by the user
: parameter to be learned

Character-Level-
Embeddings
• convolutional approach
(introduced byWaibel et al.
(1989))
• produces local features
around each character of the
word
• combines them to create a
fixed-sized character-level
embedding of the word.
2014/12/22 7

2014/12/22 8
given a word w composed
of M characters：
transform each character c_m
into a character embedding r^chr
：a fixed-sized character vocabulary
a vector size of , 1 hot vector for C
: parameter to be learned
character embedding
The input for the convolutional layer is
the sequence of character embeddings

• define the vector z for each character embeddings
Convolutional Layer
2014/12/22 9
input :
sequence of character embeddings
：window size

2014/12/22 10
The convolutional layer computes the j-th element of the
vector , which is the character-level embedding of w
：the weight matrix of
the convolutional layer
：parameters to be learned
: # of convolutional units
(which corresponds to the size of the
character-level embedding of a word)
： the size of the character context window
： the size of the character vector
hyper-parameters

Scoring
• follow Collobert et al.’s (2011) window approach to score all
tagsT for each word in a sentence
• the assumption that the tag of a word depends mainly on its
neighboring words
2014/12/22 11
Given a sentence with N words joint word-level and character-
level embedding
convert
• create a vector x_n (padding token) resulting from the
concatenation of a sequence of k^wrd embeddings,
centralized in the n-th word
The size of the context window, hyper-parameters

• the vector x_n is processed by two usual NN layers
2014/12/22 12
transfer function , the hyperbolic tangent : tanh
parameters to
be learned
# of hidden units, hyper-parameter
: score all tags for n-th word

Structured Inference
• the tags of neighboring words are strongly dependent
• prediction scheme that takes into account the sentence
structure (Collobert et al., 2011)
2014/12/22 13
: transition score from tag t to u
Given the sentence, the score for tag path is computed as follows:
infer the predicted sentence tags usingViterbi (1967) algorithm
: the set of all trainable network parameters

NetworkTraining
• network is trained by minimizing a negative likelihood over the
training set D (Collobert, 2011)
• we interpret the sentence score (5) as a conditional probability
over a path
• Equation 7 can be computed efficiently using dynamic
programming (Collobert, 2011)
2014/12/22 14
• use stochastic gradient descent (SGD) to minimize the
negative log-likelihood with respect to θ

Experimental Setup
2014/12/22 15

Preprocessing and
Tokenizing Corpora
• English
o December 2013 snapshot of
the English Wikipedia corpus
clean corpus contains
about 1.75 billion tokens
2014/12/22 16
the source of
unlabeled text
• Portuguese
o the Portuguese Wikipedia
o the CETENFolha corpus
o the CETEMPublico3 corpus
concatenating the three
corpora, the resulting
corpus contains around
401 million words
preprocessing, tokenizing

Unsupervised Learning ofWord-Level
Embeddings using the word2vec tool
• word2vec
o skip-gram method
o context window : size 5
• use the same parameters except for the minimum word
frequency (English and Portuguese)
• minimum word frequency
o English : 10
o Portuguese : 5
• resulting vocabulary size
o English : 870,214 entries
o Portuguese : 453,990 entries
2014/12/22 17

Supervised Learning of
Character-Level Embeddings
• supervised learning (not unsupervised)
• the character vocabulary is quite small if compared to the
word vocabulary
• the amount of data in the labeled POS tagging training
corpora is enough to effectively train character-level
embeddings
• trained using the raw (not lowercased) words
2014/12/22 18

POSTagging Datasets
• English
o Wall Street Journal (WSJ)
portion of the PennTreebank4
(Marcus et al., 1993)
o 45 different POS tags
o same partitions used by other
researchers(Manning, 2011;
Søgaard, 2011; Collobert et al.,
2011)
o to fairly compare our results
with previous work
2014/12/22 19
• Portuguese
o Mac-Morpho corpus (Alu ́ısio
et al., 2003)
o 22 POS tags
o the same training/test
partitions used by (dos Santos
et al., 2008)
o created a development set by
randomly selecting 5% of the
training set sentences.

• out-of-the-supervised-vocabulary words (OOSV)
o not appear in the training set
• out-of-the-unsupervised-vocabulary words (OOUV)
o do not appear in the vocabulary created using the unlabeled data
o words that we do not have word embeddings.
2014/12/22 20

Model Setup
• use the development sets to tune the neural network hyper-
parameters (to be chosen by the user)
• in SGD training, the learning rate is the hyper- parameter
that has the largest impact in the prediction
• use a learning rate schedule that decreases the learning rate
λ according to the training epoch t
2014/12/22 21

Hyper-Parameter
• use the same set of hyper-parameters for both English and
Portuguese
• This provides some indication on the robustness of our approach to
multiple languages.
• The number of training epochs is the only difference in terms of
training setup for the two languages
o TheWSJ corpus : 5 training epochs
o Mac-Morpho corpus : 8 training epochs.
2014/12/22 22

WNN model
(compared with proposed CharWNN)
• In order to assess the effectiveness of the proposed
character-level representation of words
• WNN
o essentiallyCollobert et al.’s (2011) NN architecture
o use the same NN hyper-parameters values (when applicable) shown inTable 3
o represents a network which is fed with word representations only
2014/12/22 23
• it also includes two additional handcrafted features
o capitalization feature
o suffix feature :
• capitalization and suffix embeddings have dimension 5

Experimental Results
2014/12/22 24

Results for POSTagging of
Portuguese Language
• the effectiveness of our convolutional approach to learn valuable
character-level information for POSTagging.
2014/12/22 25

Compare CharWNN with state-
of-the-art (Portuguese)
• compare the result ofCharWNN with reported state-of-the-
art results for the Mac-Morpho corpus
• In both pieces of work, the authors use many handcrafted
features, most of them to deal with out-of-vocabulary words.
• This is a remarkable result, since we train our model from
scratch, i.e., without the use of any handcrafted features.
2014/12/22 26

Portuguese SimilarWords
“Character-Level Embeddings”
• check the effectiveness of the morphological information encoded
in character-level embeddings of words
• The similarity between two words is computed as the cosine
between the two vectors
• the character-level embeddings are very effective for learning
suffixes and prefixes.
2014/12/22 27
OOSV OOUV

Portuguese SimilarWords
“Word-Level Embeddings”
• same five out-of-vocabulary words and their respective most
similar words in the vocabulary extracted from the unlabeled data
• use word-level embeddings to compute the similarity
• the words are more semantically related, some of them being
synonyms
• morphologically less similar than the ones inTable 6.
2014/12/22 28
OOSV OOUV

Results for POSTagging of
English Language
• Using the character-level embeddings of words (CharWNN) we
again get better results than the ones obtained with theWNN that
uses handcrafted features
• our convolutional approach to learning character-level
embeddings can be successfully applied to the English language
2014/12/22 29

Compare CharWNN with state-
of-the-art (English)
• a comparison of our proposedCharWNN results with
reported state-of-the-art results for the WSJ corpus
• this is a remarkable result, since differently from the other
systems, our approach does not use any handcrafted feature
2014/12/22 30

English SimilarWords
“Character-Level Embeddings”
• the similarity between the character-level embeddings of
rare words and words in the training set
• we present six OOSV words, and their respective most similar
words in the training set.
• character-level embedding approach is effective at learning
suffixes, prefixes and even word shapes
2014/12/22 31

English SimilarWords
“Word-Level Embeddings”
• the same six OOSV words and their respective most similar
words in the vocabulary extracted from the unlabeled data
• The similarity is computed using the word-level embeddings
• the retrieved words are morphologically less related than the
ones inTable 10
2014/12/22 32

Conclusion
• The main contributions
1. the idea of using convolutional neural networks to extract character-level
features and jointly use them with word-level features
2. demonstration that it is feasible to train state-of-the-art POS taggers for
different languages using the same model, with the same hyper-parameters,
and without any handcrafted features
3. the definition of a new state-of-the-art result for Portuguese POS tagging
on the Mac-Morpho corpus
• A weak point
o the introduction of additional hyper-parameters to be tuned
o However, we argue that it is generally preferable to tune hyper-parameters
than to handcraft features, since the former can be automated.
2014/12/22 33

FutureWork
• analyzing in more detail the interrelation between the two
representations: word-level and character-level
• applying the proposed strategy to other NLP tasks
o text chunking, named entity recognition, etc
2014/12/22 34

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging"

Similar to Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging" (20)

Recently uploaded

Recently uploaded (20)

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Speech Tagging"

Editor's Notes