2. Some of the challenges in Language
Understanding
â˘âŻLanguage is ambiguous:
ââŻEvery sentence has many possible
interpretations.
â˘âŻLanguage is productive:
ââŻWe will always encounter new
words or newâ¨
constructions
â˘âŻLanguage is culturally speciďŹcâ¨
â¨
2
Â
fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
3. ML: Traditional Approach
â˘âŻFor each new problem/question
ââŻGather as much LABELED data as you can get
ââŻThrow some algorithms at it (mainly put in an SVM andâ¨
keep it at that)
ââŻIf you actually have tried more algos: Pick the best
ââŻSpend hours hand engineering some features / featureâ¨
selection / dimensionality reduction (PCA, SVD, etc)
ââŻRepeatâŚâ¨
â¨
3
Â
5. Deep Learning: Why for NLP ?
â˘âŻBeat state of the art
ââŻLanguage Modeling (Mikolov et al. 2011) [WSJ AR task]
ââŻSpeech Recognition (Dahl et al. 2012, Seide et al 2011;â¨
following Mohammed et al. 2011)
ââŻSentiment ClassiďŹcation (Socher et al. 2011)
ââŻMNIST hand-written digit recognition (Ciresan et al.â¨
2010)
ââŻImage Recognition (Krizhevsky et al. 2012) [ImageNet]â¨
â¨
5
Â
6. Language semantics
â˘âŻWhat is the meaning of a word?â¨
(Lexical semantics)
â˘âŻWhat is the meaning of a sentence?â¨
([Compositional] semantics)
â˘âŻWhat is the meaning of a longer piece of
text?â¨
(Discourse semantics)â¨
â¨
6
Â
7. One-hot encoding
â˘âŻ Form vocabulary of words that maps lemmatized words to a
unique ID (position of word in vocabulary)
â˘âŻ Typical vocabulary sizes will vary between 10 000 and 250
000
7
Â
8. One-hot encoding
â˘âŻ The one-hot vector of an ID is a vector ďŹlled with 0s, except
for a 1at the position associated with the ID
â⯠for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w)
= [ 0 0 0 1 0 0 0 0 0 0 ]
â˘âŻ A one-hot encoding makes no assumption about word
similarity
â˘âŻ All words are equally diďŹerent from each other
8
Â
9. Word representation
â˘âŻ Standard
â⯠Bag of Words
â⯠A one-hot encoding
â⯠20k to 50k dimensions
â⯠Can be improved by
factoring in document
frequency
â˘âŻ Word embedding
â⯠Neural Word
embeddings
â⯠Uses a vector space
that attempts to
predict a word given a
context window
â⯠200-400 dimensions
Word
 embeddings
 make
 seman0c
 similarity
 and
Â
synonyms
 possible
 9
Â
10. Distributional representations
â˘âŻ âYou shall know a word by the company it
keepsâ (J. R. Firth 1957)
â˘âŻ One of the most successful ideas of modern
â˘âŻ statistical NLP!
10
Â
11. â˘âŻ Word Embeddings (Bengio et al, 2001; Bengio et al,
2003) based on idea of distributed representations
for symbols (Hinton 1986)
â˘âŻ Neural Word embeddings (Mnih and Hinton 2007,
Collobert & Weston 2008, Turian et al 2010;
Collobert et al. 2011, Mikolov et al.2011)
11
Â
12. Neural distributional
representations
â˘âŻ Neural word embeddings
â˘âŻ Combine vector space semantics with the
prediction of probabilistic models
â˘âŻ Words are represented as a dense vector
Human
 =
Â
Â
12
Â
14. Word embeddings
Turian,
 J.,
 Ra0nov,
 L.,
 Bengio,
 Y.
 (2010).
 Word
 representa0ons:
Â
Â
A
 simple
 and
 general
 method
 for
 semi-Ââsupervised
 learning
Â
14
Â
18. Word Embeddings
â˘âŻ one of the most exciting area of research in deep learning
â˘âŻ introduced by Bengio, et al. 2013
â˘âŻ W:wordsâRn is a paramaterized function mapping words in
some language to high-dimensional vectors (200 to 500).
â⯠W(ââcat")=(0.2, -0.4, 0.7, ...)
â⯠W(ââmat")=(0.0, 0.6, -0.1, ...)
â˘âŻ Typically, the function is a lookup table, parameterized by a
matrix, θ, with a row for each word: Wθ(wn)=θn
â˘âŻ W is initialized as random vectors for each word.
â˘âŻ Word embedding learns to have meaningful vectors to
perform some task.
18
Â
19. Learning word vectors (Collobert et al. JMLR 2011)
â˘âŻ Idea: A word and its context is a positive
training example, a random word in the same
context give a negative training example
19
Â
20. Example
â˘âŻ Train a network for is predicting whether a 5-
gram (sequence of ďŹve words) is âvalid.â
â˘âŻ Source
ââŻany text corpus (wikipedia)
â˘âŻ Break half number of 5-grams to get negative
training examples
ââŻMake 5-gram nonsensical
ââŻ"cat sat song the matâ
20
Â
21. Neural network to determine if a 5-gram is
'valid' (Bottou 2011)
â˘âŻ Look up each word in the 5-gram through W
â˘âŻ Feed those into R network
â˘âŻ R tries to predict if the 5-gram is 'valid' or 'invalid'
â⯠R(W(ââcat"), W(ââsat"), W(ââon"), W(ââthe"), W(ââmat"))= 1
â⯠R(W(ââcat"), W(ââsat"), W(ââsong"), W(ââthe"), W(ââmat"))=0
â˘âŻ The network needs to learn good parameters for both W
and R.
21
Â
23. Idea
â˘âŻ âa few people sing wellâ â âa couple people
sing wellâ
â˘âŻ the validity of the sentence doesnât change
â˘âŻ if W maps synonyms (like âfewâ and
âcoupleâ) close together
ââŻRâs perspective little changes.
23
Â
24. Bingo
â˘âŻ The number of possible 5-grams is massive
â˘âŻ But, small number of data points to learn
from
â˘âŻ Similar class of words
ââŻâthe wall is blueâ â âthe wall is redâ
â˘âŻ Multiple words
ââŻâthe wall is blueâ â âthe ceiling is redâ
â˘âŻ Shifting âredâ closer to âblueâ makes the
network R perform better.
24
Â
25. Word embedding property
â˘âŻ Analogies between words encoded in the
diďŹerence vectors between words.
ââŻW(ââwoman")âW(ââman") â W(ââaunt")âW(ââuncle")
ââŻW(ââwoman")âW(ââman") â W(ââqueen")âW(ââking")
25
Â
27. Word embedding property: Shared
representations
â˘âŻ The use of word representations⌠has become a
key âsecret sauceâ for the success of many NLP
systems in recent years, across tasks including
named entity recognition, part-of-speech tagging,
parsing, and semantic role labeling. (Luong et al.
(2013))
27
Â
28. â˘âŻ W and F learn to perform
task A. Later, G can learn
to perform B based on W
28
Â
34. Simple RNN training
â˘âŻInput vector: 1-of-N encoding (one hot)
â˘âŻRepeated epoch
ââŻS(0): vector of small value (0,1)
ââŻHidden layer: 30 â 500 units
ââŻAll training data from corpus are sequentially presented
ââŻInit learning rate: 0.1
ââŻError function
ââŻStandard backpropagation with stochastic gradient descent
â˘âŻConversion achieved after 10 â 20 epochs
34
Â
35. Word2vec (Mikolov et. al., 2013)
â˘âŻLog-linear model
â˘âŻPrevious models: non-linear hidden layer ->
complexity
â˘âŻContinuous word vectors are learned using
simple model
35
Â
36. Continuous BoW (CBOW) Model
â˘âŻSimilar to the feed-forward NNLM, but
ââŻNon-linear hidden layer removed
â˘âŻCalled CBOW (continuous BoW) because the
order of the words is lost
38. Continuous Skip-gram Model
â˘âŻSimilar to CBOW, but
ââŻTries to maximize classiďŹcation of a word based on
another word in the same sentence
â˘âŻPredicts words within a certain window
â˘âŻObservations
ââŻLarger window size => better quality of the resulting
word vectors, higher training time
ââŻMore distant words are usually less related to the current
word than those close to it
ââŻGive less weight to the distant words by sampling less
from those words in the training examples
42. Recursive neural networks
â˘âŻ Output of a module go into a module of the same
type
â˘âŻ tree-structured neural networks
â˘âŻ No ďŹxed number of inputs
42
Â
43. Building on Word Vector Space Models
â˘âŻBut how can we represent the meaning of longer
phrases?
â˘âŻBy mapping them into the same vector space! â¨
â¨
â¨
â¨
43
Â
62. Language Modeling
â˘âŻ A language model is a probabilistic model that
assigns probabilities to any sequence of words
p(w1, ... ,wT)
â˘âŻ Language modeling is the task of learning a
language model that assigns high probabilities to
well formed sentences
â˘âŻ Plays a crucial role in speech recognition and
machine translation systems
62
Â
63. N-gram models
â˘âŻ An n-gram is a sequence of n words
â⯠unigrams(n=1):ââisââ,ââaââ,ââsequenceââ,etc.
â⯠bigrams(n=2): [ââisââ,ââaââ], [ââaââ,ââsequenceââ],etc.
â⯠trigrams(n=3): [ââisââ,ââaââ,ââsequenceââ],
[ââaââ,ââsequenceââ,ââofââ],etc.
â˘âŻ n-gram models estimate the conditional from n-
grams counts
â˘âŻ The counts are obtained from a training corpus (a
dataset of word text)
63
Â