Deep learning for nlp

Deep learning for Natural
language processing
Viet-Trung Tran
1

Some of the challenges in Language
Understanding
• Language is ambiguous:
– Every sentence has many possible
interpretations.
• Language is productive:
– We will always encounter new
words or new 
constructions
• Language is culturally speciﬁc 
 

2

fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN

ML: Traditional Approach
• For each new problem/question
– Gather as much LABELED data as you can get
– Throw some algorithms at it (mainly put in an SVM and 
keep it at that)
– If you actually have tried more algos: Pick the best
– Spend hours hand engineering some features / feature 
selection / dimensionality reduction (PCA, SVD, etc)
– Repeat… 
 

3

Deep Learning: Why for NLP ?
• Beat state of the art
– Language Modeling (Mikolov et al. 2011) [WSJ AR task]
– Speech Recognition (Dahl et al. 2012, Seide et al 2011; 
following Mohammed et al. 2011)
– Sentiment Classiﬁcation (Socher et al. 2011)
– MNIST hand-written digit recognition (Ciresan et al. 
2010)
– Image Recognition (Krizhevsky et al. 2012) [ImageNet] 
 

5

Language semantics
• What is the meaning of a word? 
(Lexical semantics)
• What is the meaning of a sentence? 
([Compositional] semantics)
• What is the meaning of a longer piece of
text? 
(Discourse semantics) 
 

6

One-hot encoding
•  Form vocabulary of words that maps lemmatized words to a
unique ID (position of word in vocabulary)
•  Typical vocabulary sizes will vary between 10 000 and 250
000
7

One-hot encoding
•  The one-hot vector of an ID is a vector ﬁlled with 0s, except
for a 1at the position associated with the ID
–  for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w)
= [ 0 0 0 1 0 0 0 0 0 0 ]
•  A one-hot encoding makes no assumption about word
similarity
•  All words are equally diﬀerent from each other
8

Word representation
•  Standard
–  Bag of Words
–  A one-hot encoding
–  20k to 50k dimensions
–  Can be improved by
factoring in document
frequency
•  Word embedding
–  Neural Word
embeddings
–  Uses a vector space
that attempts to
predict a word given a
context window
–  200-400 dimensions
Word
embeddings
make
seman0c
similarity
and

synonyms
possible
9

Distributional representations
•  “You shall know a word by the company it
keeps” (J. R. Firth 1957)
•  One of the most successful ideas of modern
•  statistical NLP!
10

•  Word Embeddings (Bengio et al, 2001; Bengio et al,
2003) based on idea of distributed representations
for symbols (Hinton 1986)

•  Neural Word embeddings (Mnih and Hinton 2007,
Collobert & Weston 2008, Turian et al 2010;
Collobert et al. 2011, Mikolov et al.2011)
11

Neural distributional
representations
•  Neural word embeddings
•  Combine vector space semantics with the
prediction of probabilistic models
•  Words are represented as a dense vector
Human
=

12

Word embeddings
Turian,
J.,
Ra0nov,
L.,
Bengio,
Y.
(2010).
Word
representa0ons:

A
simple
and
general
method
for
semi-‐supervised
learning

14

•  What words have embeddings closest to a given word?
From Collobert et al. (2011)

16

Word Embeddings for MT: Mikolov (2013)
17

Word Embeddings
•  one of the most exciting area of research in deep learning
•  introduced by Bengio, et al. 2013
•  W:words→Rn is a paramaterized function mapping words in
some language to high-dimensional vectors (200 to 500).
–  W(‘‘cat")=(0.2, -0.4, 0.7, ...)
–  W(‘‘mat")=(0.0, 0.6, -0.1, ...)
•  Typically, the function is a lookup table, parameterized by a
matrix, θ, with a row for each word: Wθ(wn)=θn
•  W is initialized as random vectors for each word.
•  Word embedding learns to have meaningful vectors to
perform some task.

18

Learning word vectors (Collobert et al. JMLR 2011)
•  Idea: A word and its context is a positive
training example, a random word in the same
context give a negative training example

19

Example
•  Train a network for is predicting whether a 5-
gram (sequence of ﬁve words) is ‘valid.’
•  Source
– any text corpus (wikipedia)
•  Break half number of 5-grams to get negative
training examples
– Make 5-gram nonsensical
– "cat sat song the mat”

20

Neural network to determine if a 5-gram is
'valid' (Bottou 2011)
•  Look up each word in the 5-gram through W
•  Feed those into R network
•  R tries to predict if the 5-gram is 'valid' or 'invalid'
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))= 1
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0
•  The network needs to learn good parameters for both W
and R.
21

Idea
•  “a few people sing well” → “a couple people
sing well”
•  the validity of the sentence doesn’t change
•  if W maps synonyms (like “few” and
“couple”) close together
– R’s perspective little changes.

23

Bingo
•  The number of possible 5-grams is massive
•  But, small number of data points to learn
from
•  Similar class of words
– “the wall is blue” → “the wall is red”
•  Multiple words
– “the wall is blue” → “the ceiling is red”
•  Shifting “red” closer to “blue” makes the
network R perform better.
24

Word embedding property
•  Analogies between words encoded in the
diﬀerence vectors between words.
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king")
25

Linguis0c
Regulari0es:
Mikolov
(2013)

26

Word embedding property: Shared
representations
•  The use of word representations… has become a
key “secret sauce” for the success of many NLP
systems in recent years, across tasks including
named entity recognition, part-of-speech tagging,
parsing, and semantic role labeling. (Luong et al.
(2013))

27

•  W and F learn to perform
task A. Later, G can learn
to perform B based on W

28

English – Chinese word mapping
30

Embed images and words in a single
representation
31

Feedforward neural net language model
(NNLM) Belgio et. al., 2003
• Long training time
32

Recurrent neural network based language
model (Mikolov et. al., 2010)
• Elman Network
33

Simple RNN training
• Input vector: 1-of-N encoding (one hot)
• Repeated epoch
– S(0): vector of small value (0,1)
– Hidden layer: 30 – 500 units
– All training data from corpus are sequentially presented
– Init learning rate: 0.1
– Error function
– Standard backpropagation with stochastic gradient descent
• Conversion achieved after 10 – 20 epochs
34

Word2vec (Mikolov et. al., 2013)
• Log-linear model
• Previous models: non-linear hidden layer ->
complexity
• Continuous word vectors are learned using
simple model
35

Continuous BoW (CBOW) Model
• Similar to the feed-forward NNLM, but
– Non-linear hidden layer removed
• Called CBOW (continuous BoW) because the
order of the words is lost

Continuous Skip-gram Model
• Similar to CBOW, but
– Tries to maximize classiﬁcation of a word based on
another word in the same sentence
• Predicts words within a certain window
• Observations
– Larger window size => better quality of the resulting
word vectors, higher training time
– More distant words are usually less related to the current
word than those close to it
– Give less weight to the distant words by sampling less
from those words in the training examples

Modular Network that learns word
embeddings
•  Fixed number of inputs

41

Recursive neural networks
•  Output of a module go into a module of the same
type
•  tree-structured neural networks
•  No ﬁxed number of inputs
42

Building on Word Vector Space Models
• But how can we represent the meaning of longer
phrases?
• By mapping them into the same vector space!  
 
 
 

43

How should we map phrases into a
vector space?
44

Sentence Parsing: What we want
45

Learn Structure and Representation
46

Recursive Neural Networks for 
Structure Prediction
47

Recursive Neural Network Deﬁnition
48

Recursive Application of Relational
Operators
49

Parsing a sentence with an RNN
50

Labeling in Recursive Neural Networks
54

Recursive matrix-vector model
55

Recursive neural tensor network

56

Socher et al. 2013: Sentence sentiment
analysis
57

Reversible sentence representation
BoNou
2011

•  Bilingual sentence representation
59

Credits
•  Richard
Socher, Christopher
– Stanford University
– nlp.stanford.edu/courses/NAACL2013/
•  Roelof Pieters, PhD candidate KTH/CSC
•  http://colah.github.io/
•  Bengio GSS 2012
61

Language Modeling
•  A language model is a probabilistic model that
assigns probabilities to any sequence of words
p(w1, ... ,wT)
•  Language modeling is the task of learning a
language model that assigns high probabilities to
well formed sentences
•  Plays a crucial role in speech recognition and
machine translation systems
62

N-gram models
•  An n-gram is a sequence of n words
–  unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc.
–  bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc.
–  trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’],
[‘‘a’’,‘‘sequence’’,‘‘of’’],etc.
•  n-gram models estimate the conditional from n-
grams counts

•  The counts are obtained from a training corpus (a
dataset of word text)
63

Deep learning for nlp

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep learning for nlp

Ähnlich wie Deep learning for nlp (20)

Mehr von Viet-Trung TRAN

Mehr von Viet-Trung TRAN (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep learning for nlp