SlideShare ist ein Scribd-Unternehmen logo
1 von 289
Downloaden Sie, um offline zu lesen
Representation Learning
of Text for NLP
Anuj Gupta
@anujgupta82
anujgupta82@gmail.com
anujgupta-82
About Instructor
Anuj is Director – Machine Learning at Huawei
Technologies. Prior to this he was heading ML efforts at
FreshWorks, Airwoot and Droom; working in the area of
NLP, Vision, Machine Learning, Deep learning
Speaker at prestigious forums like Anthill, PyData, Fifth
Elephant, ICDCN, PODC, IIT Delhi, IIIT Hyderabad.
Co-organizer special interest groups like DLBLR.
@anujgupta82
anujgupta82@gmail.com
Objective of this Workshop
•  Deep dive into state-of-the-art techniques for representing text data.
•  By the end of this workshop, you would have gained a deeper understanding of key
ideas, maths and code powering these techniques.
•  You will be able to apply these techniques in solving NLP problems of your interest.
•  Help you achieve higher accuracies.
•  Help you achieve deeper understanding.
•  Target audience: Data science teams, industry practitioners, researchers, enthusiast in
the area of NLP
•  This will be a very hands-on workshop
4
I learn best with toy
code
that I can play with.
But unless you know
key concepts, you can’t
code.
In this workshop, we will do both
5
Outline
Workshop is divided into 4 modules. We will cover module 1 and 2 Day 1. Module
3 and 4 on Day 2. The github repo has folders for each of the 4 modules containing
respective notebooks.
• Module 1
• Archaic techniques
• Module 2
• Word vectors
• Module 3
• Sentence/Paragraph/Document vectors
• Module 4
• Character vectors
6
• Check List
• Github repo installed
• Jupyter up and running
• Loud and clearly audible
• Ground Rules
• Got a question ? Stop me then & there and ask.
• No questions are stupid.
• Please respect other’s time.
• ML ≠ Software Engineering
• ML is not merely code.
• Garbage out Garbage in
• If a model (does not) work well, you need to understand why it does so.
• Fairness, Bias
• These are often driven by the assumptions, hypothesis and maths of the model 7
Recent Advances
8
9https://blog.openai.com/unsupervised-sentiment-neuron/
Sentiment Neuron
Resurrect your dead friend as an AI
10
Luka - Eugenia lost her friend Roman in an accident. Determined not to lose his memory, she gathered
all the texts Roman sent over his short life and made a chatbot – a program that responds automatically
to text messages. Now whenever she is missing Roman, Eugenia sends the chatbot a message and
Roman’s words respond.
Google smart reply
11
12
Module 1
Topics
•  Introduction to NLP
•  Examples of various NLP tasks
•  Archaic Techniques
•  Using pretrained embeddings
Key Learning outcomes:
•  Basics of NLP
•  One hot encoding
•  Bag of words
•  N-gram
•  TF-IDF
•  Why these techniques are bad
•  How can you use pretrained embeddings
•  Issues with using pre-trained word embeddings
13
What is NLP
•  Concerned with programming computers to fruitfully process large natural
language.
•  It is at the intersection of computer science, artificial intelligence and
computational linguistics
14
Some NLP tasks:
•  Spell Checking
•  Finding Synonyms
• Keyword Search
15
Sentiment analysis
16
•  Co-reference (e.g. What does "he" or "it" refer to given a document?)
17
•  Machine Translation (e.g. translate Chinese text to English)
18
NLP is not easy !
19
Input to any NLP system
•  Cannot directly feed the raw text to machine learning algorithms;
•  One must first convert to some numerical form.
•  Why ? ML algorithms assume that all features used to represent an
observation are numeric
•  This conversion from raw text to a suitable numerical form is called
text representation.
20
• Example: we wish to build a system for sentiment analysis
• Sentiment is embedded in the meaning. Hence, to correct predict
sentiment, we understand the meaning of the sentence.
• To extract the right meaning from a sentence, following are the most
crucial data points:
• Break the sentence into lexical units and derive the meaning for each of the lexical
units.
• Understand syntactic (grammatical) structure of the sentence.
• Understand the context in which the sentence appears.
• The semantics (meaning) of the sentence comes from the above 3 points
combined together. 21
• The text representation scheme that we choose to represent our text,
must facilitate the extraction of the above mentioned data points in
the best possible manner.
• The process of extracting these data points is also called feature
extraction or feature encoding.
• Only once we have extracted the right features, can one aim to use a
suitable machine learning algorithm that can better utilize these
features and deliver satisfactory results.
22
23
NLP pipeline
Raw Text Preprocessing
Tokenization to
get language
units.
Mathematical
representation of
language unit
Build train/test
data
Train model
using training
data
Test the model
on test data
The first and arguably most important common denominator across
all NLP tasks is : how we represent text as input to our models.
• Machine does not understand text and need
a numeric representation.
• Images have a natural representation
scheme
• RGB matrix), for text there is no obvious
way
• An integral part of any NLP pipeline
Why text representation is important?
24
• Like images, speech also has a very natural way
• For text there is no obvious way
Why text representation is important?
25
• An integral part of any NLP pipeline
• Representation learning
• set of techniques that learn features : a transformation of the raw data input to
a representation that can be effectively exploited in machine learning tasks.
• Part of feature engineering/learning.
• Get rid of “hand-designed” features and representation
• Unsupervised feature learning - obviates manual feature engineering
What & Why Representation learning
26
Raw
Input
Learning Algorithm Output
Representation Learning
27
Larger Picture
Vector Space Models
• Represent text units (characters, phonemes, words, sentences,
paragraphs, documents) as vectors of numbers.
• Vector space model or term vector model - an algebraic model for
representing text documents as vectors.
• Similarity between 2 documents = cosine similarity. Cosine of the angle
between the vectors.
• We will see VSM in various flavors
28
•  One hot encoding
•  Bag of words
•  N-gram
•  TF-IDF
29
Legacy Techniques
One hot encoding
•  Map each word to a unique ID
•  Typical vocabulary sizes will vary between 10k and 250k.
30
•  Use word ID, to get a basic representation of
word through.
•  This is done via one-hot encoding of the ID
•  one-hot vector of an ID is a vector filled with 0s,
except for a 1 at the position associated with the
ID.
•  ex.: for vocabulary size D=10, the one-hot
vector of word (w) ID=4 is e(w) = [ 0 0 0 1 0
0 0 0 0 0 ]
31
•  Begins by building a dictionary that maps the vocabulary of the corpus to identifiers
•  Map each word to a unique ID
32
•  One-hot encoding makes no assumption about word similarity
•  all words are equally similar/different from each other
•  this is a natural representation to start with, though a poor one
Drawbacks
• Size of input vector scales with size of vocabulary
• Must pre-determine vocabulary size.
• Cannot scale to large or infinite vocabularies (Zipf’s law!)
• Computationally expensive - large input vector results in far too many
parameters to learn.
• “Out-of-Vocabulary” (OOV) problem
• How would you handle unseen words in the test set?
• One solution is to have an “UNKNOWN” symbol that represents low-
frequency or unseen words
33
• No relationship between words
• Each word is an independent unit vector
•  D(“cat”, “refrigerator”) = D(“cat”, “cats”)
•  D(“spoon”, “knife”) = D(“spoon”, “dog”)
• In the ideal world…
• Relationships between word vectors reflects relationships between words
• Features of word embeddings reflect features of words
• These vectors are sparse:
• Vulnerable to overfitting: sparse vectors most computations go to
zero resultant loss function has very few parameters to update.
34
Bag of Words
• Analyse different “bags of words” and classify them accordingly.
• Vocab = set of all the words in corpus
• Document = Words in document w.r.t vocab with multiplicity
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
35
Pros
+ Quick and Simple
+ This is a very natural scheme for text representation.
+ Captures multiplicity of word occurrence in a document.
+ Documents with same words will have their vectors closer to each other
in euclidean space as compared to documents with completely different
words.
S1 : Dog bites man. S2 : Man bites dog. S3 : Dog eats meat. S4 : Man eats
food.
One possible assignment is: dog = 1, bites = 2, man = 3, meat = 4, food = 5
and eats = 6
S1 : [1,1,1,0,0,0], S2 : [1,1,1,0,0,0], S3 : [1,0,0,1,0,1], S4 : [0,0,1,1,0,1]
36
Cons
- Too simple
- Orderless
- No notion of syntactic/semantic similarity
- Does not capture the similarity between different words that mean the
same. Say, 3 sentences - ‘I run’, ‘I ran’ and ‘I ate’. All three will equally
apart.
- Out of vocabulary words are simply ignored. There is no provision to
handle new words at test time. Only way out ‘UNK’ token and factor
‘UNK’ token at train time.
- Word ordering is lost, hence context of words is lost. In bag of words
scheme word ordering of words does not matter. It is only only the
frequency of words that gets captured.
37
N-gram model
• Attempt to incorporate word ordering into the encoded vector.
•  break the sentences/documents into chunks of n contiguous words/tokens
• Vocab = set of all n-grams in corpus.
• Document = n-grams in document w.r.t vocab with multiplicity
For bigram:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and, and the}
Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
38
Pros & Cons
+ Tries to incorporate order of words
- Very large vocab set
- No notion of syntactic/semantic similarity
39
Term Frequency–Inverse Document Frequency (TF-IDF)
• Captures importance of a word to a document in a corpus.
• Importance increases proportionally to the number of times a word
appears in the document; but is inversely proportional to the frequency of
the word in the corpus.
• TF(t) = (Number of times term t appears in a document) / (Total number
of terms in the document).
• IDF(t) = log (Total number of documents / Number of documents with
term t in it).
• TF-IDF (t) = TF(t) * IDF(t)
40
• S1 : Dog bites man. S2 : Man bites dog. S3 : Dog eats meat. S4 : Man
eats food.
• the idf values for the terms are:
dog = log2(4/3) = 0.4114
bites = log2(4/2) = 1
man = log2(4/3) =0.4114
• tf values. Since each term appears exactly once and each document
has exactly 3 terms., tf score for each term is ⅓
• Therefore, tf-idf scores are:
dog = 0.4114 * 0.33 = 0.135
bites = 1* 0.33 = 0.33
man = 0.4114 * 0.33 = 0.135 41
Pros & Cons
• Pros:
• Easy to compute
• Has some basic metric to extract the most descriptive terms in a document
• Thus, can easily compute the similarity between 2 documents using it
• Disadvantages:
• Based on the bag-of-words (BoW) model, therefore it does not capture position
in text, semantics, co-occurrences in different documents, etc.
• Thus, TF-IDF is only useful as a lexical level feature.
• Cannot capture semantics (unlike topic models, word embeddings)
42
KEEP
CALM
it’s time
for
#coding
43
Bottom Line
• More often than not, how rich your input representation is has huge bearing on
the quality of your downstream ML models.
• For NLP, archaic techniques treat words as atomic symbols. Thus every 2
words are equally apart.
• They don’t have any notion of either syntactic or semantic similarity between
parts of language.
• This is one of the chief reasons for poor/mediocre performance of NLP based
models.
But this has changed dramatically in past few years
44
45
Module 2
Word Vectors
Topics
• Word level language models
• tSNE : Visualizing word-embeddings
• Demo of word vectors.
Key Learning outcomes:
•  Key ideas behind word vectors
•  Maths powering their formulation
•  Bigram, SkipGram, CBOW
•  Train your own word vectors
•  Visualize word embeddings
•  GloVe
•  How GloVe different Word2Vec
•  Evaluating word vectors
•  tSNE
•  how is tSNE it different from PCA
47
Distributional & Distributed Representations
48
Distributional representations
•  Linguistic aspect.
•  Based on co-occurrence/ context
•  Distributional hypothesis: linguistic units with similar distributions
have similar meanings.
•  Meaning is defined by the context in which a word appears. This is
‘connotation’.
•  This is contrast with ‘denotation’ - literal meaning of a word.
Rock-literally means a stone but can also be used to refer to a person as solid and stable. “Anthill rocks”
•  The distributional property is usually induced from document or context
or textual vicinity (like sliding window).
49
Distributed representations
•  Compact, dense and low dimensional representation.
•  Differs from distributional representations as the constraint is to seek
efficient dense representation, not just to capture the co-occurrence
similarity.
•  Each single component of vector representation does not have any
meaning of its own. Meaning is smeared across all dimensions.
•  The interpretable features (for example, word contexts in case of
word2vec) are hidden and distributed among uninterpretable vector
components.
50
•  Embedding: Mapping between space with one dimension per linguistic
unit (character, morpheme, word, phrase, paragraph, sentence, document)
to a continuous vector space with much lower dimension.
•  For the rest of this presentation, “meaning” of linguistic unit is represented
by a vector of real numbers.
51
good
Using pre-trained word embeddings
•  Most popular - Google’s word2vec, Stanford’s GloVe
•  Use it as a dictionary - query with the word, and use the vector
returned.
•  Sentence (S) - “The cat sat on the table”
•  Challenges:
•  Representing sentence/document/paragraph.
•  sum
•  Average of the word vectors.
•  Weighted mean
52
•  Handling Out Of Vocabulary (OOV) words.
•  Transfer learning (i.e. fine tuning on data).
53
For the rest of this presentation we will see various technique to build/
train our own embeddings
54
55
Demo
Global Matrix Factorization
56
John Rupert Firth
“You shall know a word by the company it keeps”
-1957
• English linguist
• Most famous quote in NLP (probably)
• Modern interpretation: Co-occurrence is a good
indicator of meaning
• One of the most successful ideas of modern
statistical NLP
57
Co-occurrence with SVD
•  Define a word using the words in its context.
•  Words that co-occur
•  Building a co-occurrence matrix M.
Context = previous word and
next word
Corpus ={“I like deep learning.”
“I like NLP.”
“I enjoy flying.”}
58
•  Imagine we do this for a large
corpus of text
•  row vector xdog describes usage
of word dog in the corpus
•  can be seen as coordinates of
point in n-dimensional
Euclidean space Rn
•  Reduce dimensions using SVD =
M
59
•  Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n
•  M = U Σ VT
•  U is an m × m orthogonal matrix (UUT = I)
•  Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1 ≥
σ2 ≥ · · · ≥ σr ≥ 0, where r = min(m, n)) [σi’s are known as singular values]
•  V is an n × n orthogonal matrix (VVT = I)
•  We construct M’ s.t. rank(M’) = k
• We compute M’ = U Σ’ V, where Σ’ = Σ with k largest singular values
•  k captures desired percentage variance
•  Then, submatrix U v,k is our desired word embedding matrix.
60
Result of SVD based Model
K = 2 K = 3
61
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
62
Pros & Cons
+ Simple method
+ Captures some sense (though weak) of similarity between words.
-  Matrix is extremely sparse.
-  Quadratic cost to train (perform SVD)
-  Drastic imbalance in frequencies can adversely impact quality of
embeddings.
-  Adding new words is expensive.
Take home : we worked with statistics of the corpus rather than working with
the corpus directly. This will recur in GloVe
63
BiGram Model
Idea:	
  Directly	
  learn	
  low-­‐dimensional	
  word	
  vectors	
  ?	
  
64
Language Models
•  Filter out good sentences from bad ones.
•  Good = semantically and syntactically correct.
•  Modeled this via probability of given sequence of n words
Pr (w1, w2, ….., wn)
•  S1 = “the cat jumped over the dog”, Pr(S1) ~ 1
•  S2 = “jumped over the the cat dog”, Pr(S2) ~ 0
65
Unary Language Models
66
Binary Language Models
67
BiGram Model
•  Objective : given wi , predict wi+1
•  Training data: given sequence of n words < w1, w2, ….., wn >, extract bi-gram
pairs (wi-1 , wi)
•  Knowns:
•  input – output training examples : (wi-1 , wi)
•  Vocab of training corpus (V) = U (wi)
•  Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
•  Model : shallow net
68
wi-1
wi
Scoringlayer
Softmaxlayer
Embedding matrix
Architecture
69
•  Feed index of wi-1 as input to network.
•  Use index to lookup embedding matrix.
•  Perform affine transform on word embedding to get a score vector.
•  Compute probability for each word.
•  Set 1-hot vector of wi as target.
•  Set loss = cross-entropy between probability vector and target vector.
Steps
70
Softmax
71
Cross Entropy
72
● Per word, we have 2 vectors :
1.  As row in Embedding layer (E)
2.  As column in weights layer (used for afine transformation)
● It’s common to take average of the 2 vectors.
● It’s common to normalise the vectors. Divide by norm.
● An alternative way to compute ŷi : # (wi, wi-1) / # (wj, wi-1) ∀ j∈V
● Use co-occurrence matrix to compute these counts.
Remarks
73
I learn best with toy code,
that I can play with.
- Andrew Trask
jupyter notebook 1
74
CBOW
SkipGram
75
CBOW
•  Continuous Bag of words.
•  Proposed by Mikolov et al. in 2013
•  Conceptually, very similar to Bi-gram model
•  In the bigram model, there were 2 key drawbacks:
1.  The context was very small – we took only wi-1 , while predicting wi
2.  Context is not just preceding words; but following words too.
76
•  “the brown cat jumped over the dog”
Context = the brown cat over the dog
Target = jumped
•  Context window = k words on either side of the word to be
predicted.
•  Pr (w1, w2, ….., wn) = ∏ Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k)
•  W = total number of unique windows
•  Each window is sliding block 2c+1 words
77
CBOW Model
•  Objective : given wc−k, . . . , wc−1, wc+1, . . . , wc+k , predict wc
•  Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract context and target (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
•  Knowns:
•  input – output training examples : (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
•  Vocab of training corpus (V) = ∪(wi)
•  Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
78
Architecture
79
•  Feed indexes of (x(c−k) , ... , x(c−1) , x(c+1) , ... , x(c+k)) for the input context of size
k.
•  Use indexes to lookup embedding matrix.
•  Average these vectors to get vˆ = (vc−k+vc−1+...+vc+1+vc+k ) / 2m
•  Perform affine transform on vˆ to get a score vector.
•  Turn scores in probabilities for each word.
•  Set 1-hot vector of wc as target.
•  Set loss = cross-entropy between probability vector and target vector.
Steps
80
Maths behind the scene
•  Optimization objective J = - log Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k)
•  Maximizing Pr() = Minimizing – log Pr()
•  Let vˆ = (wc−k + . . . + wc−1 + wc+1 + . . . + wc+k )/ 2m
•  Then, RHS
•  gradient descent to update all relevant word vectors uc and wj.
81
Skip-Gram model
•  2nd model proposed by Mikolov et al. in 2013
•  Turns CBOW over its head.
•  CBOW = given context, predict the target word
•  Skip Gram = given target, predict context
•  “the brown cat jumped over the dog”
Target = jumped
Context = the, brown, cat, over, the, dog
82
•  Objective : given wc , predict wc−k, . . . , wc−1, wc+1, . . . , wc+k
•  Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract target and context pairs (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
•  Knowns:
•  input – output training examples : (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
• Vocab of training corpus (V) = ∪ (wi)
•  Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
83
Architecture
84
•  Feed index of xc
•  Use index to lookup embedding matrix.
•  Perform affine transform on vˆ to get a score vector.
•  Turn scores in probabilities for each word.
•  Set 1-hot vector of wc as target.
•  Set loss = cross-entropy between probability vector and target vector.
Steps
85
Maths behind the scene
•  Optimization objective J = - log Pr(wc−k, . . . , wc−1, wc+1, . . . , wc+k | , wc)
•  gradient descent to update all relevant word vectors uc and wj.
86
Evaluating Word vectors
87
•  How to quantitatively evaluate the quality of word vectors?
•  Intrinsic Evaluation :
•  Word Vector Analogies
•  Extrinsic Evaluation :
•  Downstream NLP task
88
Intrinsic Evaluation
•  Specific Intermediate subtasks
•  Easy to compute.
•  Analogy completion:
•  a:b :: c:? d =
man:woman :: king:?
•  Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
•  Discarding the input words from the search!
•  Problem: What if the information is there but not linear?
89
90
Extrinsic Evaluation
•  Real task at hand
•  Ex: Sentiment analysis.
•  Not very robust.
•  End result is a function of whole process and not just embeddings.
•  Process:
•  Data pipelines
•  Algorithm(s)
•  Fine tuning
•  Quality of dataset
91
Speed Up
92
Bottleneck
•  Recall, to calculate probability, we use softmax. The denominator is
sum across entire vocab.
•  Further, this is calculated for every window.
•  Too expensive.
•  Single update of parameters requires to iterate over |V|. Our vocab
usually is in millions.
93
To approximate probability, dont use the entire vocab.
There are 2 popular line of attacks to achieve this:
• Modify the structure the softmax
• Hierarchical Softmax
•  Sampling techniques : don’t use entire vocabulary to compute the sum
•  Negative sampling
94
●  Arrange words in vocab as leaf units of a
balanced binary tree.
●  |V| leaves |V| - 1 internal nodes
●  Each leaf node has a unique path from root to
the leaf
●  Probability of a word (leaf node Lw) =
Probability of the path from root node to leaf Lw
●  No output vector representation for words,
unlike softmax.
●  Instead every internal node has a d-dimension
vector associated with it - v’n(w, j)
Hierarchical Softmax
n(w, j) means the j-th unit on the path from root to the
word w
●  Product of probabilities over nodes in the path
●  Each probability is computed using sigmoid
● 
●  Inside it we check : if (j+1)th node on path left child of jth node or not
●  v’n(w, j)
T h : vector product between vector on hidden layer and vector for the
inner node in consideration.
●  p(w = w2)
●  We start at root, and navigate to leaf w2
● 
● 
●  p(w = w2)
● 
Example
●  Cost: O(|V|) to O(log ⁡|V| )
● In practice, use Huffman tree
Negative Sampling
● Given (w, c) : word and context
● Let P(D=1|w,c) be probability that (w, c) came from the corpus data.
● P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data.
●  Lets model P(D=1|w,c) with sigmoid:
● Objective function (J):
○  maximize P(D=1|w,c) if (w, c) is in the corpus data.
○  maximize P(D=0|w,c) if (w, c) is not in the corpus data.
● We take a simple maximum likelihood approach of these two probabilities.
θ is parameters of the model. In our case U and V - input, output word vectors.
Took log on
both side
● Now, maximizing log likelihood = minimizing negative log likelihood.
● 
●  D ̃ is “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over"
●  Generate D ̃ on the fly by randomly sampling this negative from the word bank.
●  For skip-gram, our new objective function for observing the context word wc − m + j given
the center word wc would be :
regular softmax loss for skip-gram
●  Likewise for CBOW, our new objective function for observing the center
word uc given the context vector
●  In the above formulation, {u˜k |k = 1 . . . K} are sampled from Pn(w).
●  best Pn(w) = Unigram distribution raised to the power of 3/4
●  Usually K = 20-30 works well.
regular softmax loss for CBOW
GloVe
http://www.spndev.com/neww2v.html
Word2Vec model on Fox News broadcasts
Global matrix factorization methods
●  Use co-occurrence counts
●  Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret &
Collobert)
+ Fast training
+  Efficient usage of statistics
+ Captures word similarity
-  Do badly on analogy tasks
-  Disproportionate importance given to large counts
105
Local context window method
●  Use window to determine context of a word
●  Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert &
Weston)
+  Capture word similarity.
+  Also performance better on analogy tasks
-  Slow down with increase in corpus size
-  Inefficient usage of statistics
106
Combining the best of both worlds
●  Glove model tries to combine the two major model families :-
○  Global matrix factorization (co-occurrence counts)
○  Local context window (context comes from window)
= Co-occurrence counts with context distance
107
Co-occurrence counts with context distance
●  Uses context distance : weight each word in context window using its
distance from the center word
●  This ensures nearby words have more influence than far off ones.
●  Sentence -> “I like NLP”
○  Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5
○  Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0
○  Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0
●  Corpus C: I like NLP. I like cricket.
Co-occurrence matrix for C
108
Issues with Co-occurrence Matrix
●  Long tail distribution
●  Frequent words contribute disproportionately
(use weight function to fix this)
●  Use Log for normalization
●  Avoid log 0 : Add 1 to each Xij X21
109
Intuition for Glove
● Think of matrix factorization algorithms used in recommendation systems.
● Latent Factor models
○  Find features that describe the characteristics of rated objects.
○  Item characteristics and user preferences are described using vectors which are called factor
vectors z
○  Assumption: Ratings can be inferred from a model put together from a smaller number of
parameters
110
Latent Factor models
●  Dot product estimates user’s interest in the item
○  where, qi : factor vector for item i.
pu : factor vector for user u
i : estimated user interest
●  How to compute vectors for items and users ?
111
Matrix Factorization
● rui : known rating of user u for item i
●  predicted rating :
●  Similarly glove model tries to model the co-occurrence counts with the
following equation :
112
Weighting function
.
● Properties of f(X)
○ vanish at 0 i.e. f(0) = 0
○ monotonically increasing
○ f(x) should be relatively small for large values of x
●  Empirically 𝞪 = 0.75, xmax=100 works best
113
Loss Function
●  Scalable.
●  Fast training
○  Training time doesn’t depend on the corpus size
○  Always fitting to a |V| x |V| matrix.
●  Good performance with small corpus, and small vectors.
114
● Input :
○ Xij (|V| x |V| matrix) : co-occurrence matrix
● Parameters
○  W (|V| x |D| matrix) & W˜ (|V| x |D| matrix) :
■  wi and wj˜ representation of the ith & jth words from W and W˜ matrices respectively.
○ bi (|V| x 1) column vector : variable for incorporating biases in terms
○ bj (1 x |V|) row vector : variable for incorporating biases in terms
Training
115
●  Train on Wikipedia data
● |V| = 2000
●  Window size = 3
●  Iterations = 10000
● D = 50
● Learn two representations for each word in |V|.
● reg = 0.01
● Use momentum optimizer with momentum=0.9.
Quick Experiment
116
Results - months & centuries
117
Countries & languages
118
military terms
119
Music
120
Countries & Residents
Languages
Countries
121
Notebook
122
Glove Notebook
t-SNE
Artworks mapped using Machine Learning.
Art work Mapped using t-SNE
https://artsexperiments.withgoogle.com/tsnemap/#47.68,1025.98,361.43,51.29,0.00,271.67
Objective
●  Given a collection of N high-dimensional objects x1, x2, …. xN.
●  How can we get a feel for how these objects are (relatively) arranged ?
125
Introduction
● Build map(low dimension) s.t. distances between points reflect “similarities” in
the data :
● Minimize some objective function that measures the discrepancy between
similarities in the data and similarities in the map
126
Principal Components Analysis
127
Visualization with t-SNE
128
Principal component analysis
●  PCA mainly tries to preserve large pairwise distances in the map.
● Is that what we want ?
129
Goals
●  Preserve Distances
●  Preservation Neighborhood of each point
130
t-SNE High dimension
● Measure pairwise similarities between high dimensional objects
xi
xj
131
t-SNE Lower dimension
● Measure pairwise similarities between low dimensional map points
132
t-SNE
● We have measure of similarity of data points in High Dimension
● We have measure of similarity of data points in Low Dimension
● We need a distance measure between the two.
● Once we have distance measure, all we want is : to minimize it
133
One possible choice - KL divergence
●  It’s a measure of how one probability distribution diverges from a second
expected probability distribution
134
KL divergence applied to t-SNE
Objective function (C)
●  We want nearby points in high-D to remain nearby in low-D
○  In the case it's not, then
■  pij will large (because points are nearby)
■  but qij will be small (because points are far away)
■  This will result in larger penalty
■  In contrast, If both pij and qij are large : lower penalty 135
KL divergence applied to t-SNE
● Likewise, we want far away points in high-D to remain (relatively) far away in
low-D
○  In the case it's not, then
■  pij will small (because points are far away)
■  but qij will be large (because points are nearby)
■  This will result in lower penalty
●  t-SNE mainly preserves local similarity structure of the data
136
t-Distributed Stochastic Neighbor Embedding
● Move points around to minimize :
137
Why a Student t-Distribution ?
● t-SNE tries to retain local structure of this data in the map
● Result : dissimilar points have to be modelled as far apart in the map
● Hinton, has showed that student t-distribution is very similar to gaussian
distribution
Local structures
global structure
●  Local structures preserved
●  global structure is lost
138
Deciding the effective number of neighbours
●  We need to decide the radii in different parts of the space, so that we can keep
the effective number of neighbours about constant.
●  A big radius leads to a high entropy for the distribution over neighbors of i.
●  A small radius leads to a low entropy.
●  So decide what entropy you want and then find the radius that produces that
entropy.
●  It's easier to specify 2entropy
○  This is called the perplexity
○  It is the effective number of neighbors.
139
140
https://distill.pub/2016/misread-tsne/
Hyper parameters really matter: Playing with perplexity
●  projected 100 data points clearly separated in two different clusters with tSNE
●  Applied tSNE with different values of perplexity
●  With perplexity=2, local variations in the data dominate
●  With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data
141
Hyper parameters really matter: Playing with #iterations
●  Perplexity set to 30.0
●  Applied tSNE with different number of iterations
●  Takeaway : different datasets may require different number of iterations
142
Cluster sizes can be misleading
●  Uses tSNE to plot two clusters with different standard deviation
●  bottomline, we cannot see cluster sizes in t-SNE plots
143
Distances in t-SNE plots
●  At lower perplexity clusters look equidistant
●  At perplexity=50, tSNE captures some notion of global geometry in the data
●  50 data points in each sub cluster
144
Distances in t-SNE plots
●  tSNE is not able to capture global geometry even at perplexity=50.
●  key take away : well separated clusters may not mean anything in tSNE.
●  200 data points in each sub cluster
145
Random noise doesn’t always look random
●  For this experiment, we generated random points from gaussian distribution
●  Plots with lower perplexity, showing misleading structures in the data
146
You can see some shapes sometimes
●  Axis aligned gaussian distribution
●  For certain values of perplexity, long cluster look almost correct.
●  tSNE tends to expands regions which are denser
147
Experiments
Notebook
148
t-SNE Applications
https://aiexperiments.withgoogle.com/bird-sounds/view/
Why word2vec does
better than others ?
150
At heart they are all same !!
● Its has been shown that in essence GloVe and word2vec are no different
from traditional methods like PCA, LSA etc (Levy et al. 2015 call them
DSM )
● GloVe ⋍ PCA/LSA is straightforward (both factorize global counts matrix)
● word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015)
● They show that in essence word2vec also factorizes word context matrix
(PMI)
151
● Despite this “equality” of algorithm, word2vec is still known to do better
on several tasks.
● Why ?
○ Levy et al. 2015 show : magic lies in Hyperparameters
152
Hyperparameters
● Pre-processing
○  Dynamic context window
○  Subsampling frequent words
○  Deleting rare words
● Post-processing
○  Adding context words
○  Vector normalization
153
Pre-processing
● Dynamic Context window
○  In DSM, context window: unweighted & constant size.
○  Glove & SGNS - give more weightage to closer terms
○  SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize.
● Subsampling frequent words
○  SGNS dilutes frequent words by randomly removing words whose frequency f is higher than
some threshold t, with probability
● Deleting rare words
○  In SGNS, rare words are also deleted before creating context windows.
154
Post-processing
● Adding context vectors
○  Glove adds word vectors and the context vectors for the final representation.
● Vector normalization
○  All vectors can be normalized to unit length
155
Key Take Home
● Hyperparameters vs Algorithms
○  Hyper parameter settings is more important than the algorithm choice
○  No single algorithm consistently outperforms the other ones
● Hyperparameters vs more data
○  Training on larger corpus helps on some tasks
○  In many cases, tuning hyperparameters in more beneficial
156
References
Idea of word vectors is not new.
•  Learning representations by back-propagating errors (Rumelhart et al. 1986)
•  A neural probabilistic language model (Bengio et al., 2003)
•  NLP from Scratch (Collobert & Weston, 2008)
•  Word2Vec (Mikolov et al. 2013)
• Sebastian Ruder’s 3 part Blog series
• Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher
• word2vec Parameter Learning Explained by X Rong
157
•  GloVe :
• https://nlp.stanford.edu/pubs/glove.pdf
•  https://www.youtube.com/watch?v=tRsSi_sqXjI
•  http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
•  https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf
•  t-SNE:
• http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
•  http://distill.pub/2016/misread-tsne/
•  https://www.slideshare.net/ssuserb667a8/visualization-data-using-tsne
•  https://youtu.be/RJVL80Gg3lA
•  KL Divergence
•  http://tdhopper.com/blog/2015/Sep/04/cross-entropy-and-kl-divergence/
158
•  Cross Entropy :
•  https://www.youtube.com/watch?v=tRsSi_sqXjI
•  http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
•  Softmax:
•  https://en.wikipedia.org/wiki/Softmax_function
•  http://cs231n.github.io/linear-classify/#softmax
•  Tensor Flow
•  1.0 API docs
•  CS20SI
159
• Module 1
• Introduction
• Archaic Techniques
• Using pretrained embeddings
• Module 2
• Introduction to embeddings
• Using pretrained embeddings
• Word level representation
• Visualizing word embedding
• Module 3
• Sentence/Paragraph/Document level representation
• Skip-Thought Vectors
• Module 4
• Character level representation
160
Doc2Vec
• Document level language models
Key Learning outcomes:
•  Combining word vectors
•  Key ideas behind document vectors
•  DM, DBOW
•  How are they similar/different from
word vectors
•  Drawbacks of these approaches
•  Skip-Thought vectors
•  RNNs: LSTM, GRU
•  Architecture of skip-thought vectors
162
Module 3
163
Story generation from images
Story generation from images
164
Sentence Representation
Task : Train a ML model for sentiment classification.
Problem :
Given a sentence, predict its sentiment.
Solution:
1)  Represent the sentence in mathematical format
2)  Train a model on data - sentence, label
How do you represent the sentence ? we want a representation that captures the
semantics of the sentence.
165
We already have word vectors.
Can we use these to come up with a way to represent the sentence ?
Eg :- “the cat sat on the table”
We have vectors for “the”, “cat”, “sat”, “on”, “the” & “table”.
How can we use the vectors for words to get vector for sentence ?
166
Possible Solutions
Sentence (S) - “The cat sat on the table”
Concatenation : Our sentence is one word followed by another.
So, its representation can be - word vectors for every word in sentence in same
order.
Sv = [wvThe wvcat wvsat wvon wvthe wvtable]
Each word is represented by a d-dimensional vector, so a sentence with k words
has k X d dimensions.
Problem : Different sentences in corpus will have different lengths. Most ML
models work with fixed length input.
167
Mean of word vectors:
Weighted average of the word vectors
Sv =
168
Fallacies
●  Different sentences with same words but different ordering will give same
vector.
○  “are you good” vs “you are good”
●  Negation - opposite meaning but very similar words
○  “I do want a car” vs “I don’t want a car”
If word vectors for “do” and “don’t” are close by, then in this case their
sentence vectors will also be close by. If these 2 sentences are in opposite
Classes, we are in trouble.
●  Sentence vector generated via simple operations on word vectors - often do
not capture syntactic and semantics properties.
169
Motivation
●  Build vector representation at sentence/paragraph/document level such that it
has the following properties :
○  Syntactic properties:
■  Ordering of words
○  Semantic properties:
■  Sentences that have the same meaning should come together.
■  Capturing negation.
○  Provide fixed length representation for variable length text.
170
Solution
●  Doc2Vec*
○  Distributed Memory (DM)
○  Distributed Bag Of Words (DBOW)
●  We will study these 2 methods to learn a representation for text at paragraph
level. However, this is applicable directly at sentence and document level too.
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
171
Distributed Memory (DM)
●  We saw that word2vec uses context words to predict the target word.
●  In distributed memory model, we simply extend the above idea - we use
paragraph vector along with context word vectors to predict the next word.
●  S = “The cat sat on the table”
●  (Sv , wvThe, wvcat, wvsat) wvon
172
Architecture
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
D
Para2vec Matrix
W
word2vec matrix
ddv
|N|
dW
|V|
173
Details
●  Each document is represented by a ddv dimensional vector.
●  Each word is represented by dw dimensional vector.
●  Index the vectors for document d and word w1, w2 & w3 (i.e. The, cat & sat)
●  These vectors are then combined (concatenate/average) for predicting next
word (w4) in document.
174
Details
●  Objective of word vector model.
●  Prediction is obtained through multi class classification.
●  Each of yi is un-normalized log-probability for each output word i.
●  where U, b are the softmax parameters. h is constructed by a concatenation
or average of word vectors extracted from W.
●  Cross entropy loss function is used to learn the representation of the word
and each document vector.
175
Generating representation at test time
Sentence : “I got back home.”
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
176
Distributed Bag of words(DBOW)
●  We saw that word2vec uses target word to predict the context words.
●  In dbow model, we simply extend the above idea - we use paragraph vector
to predict the words.
●  S = “The cat sat on the table”
(Sv ) (wvThe, wvcat, wvsatwvon )
177
Architecture
●  Words and the ordering of the words
uniquely define a paragraph.
●  Reversing this : a paragraph uniquely
defines the words and their ordering
present in the paragraph.
●  Thus, given a paragraph representation,
we should be able to predict the words in
the paragraph
●  This is precisely what DBOW does.
178
DBOW
●  Each document is represented by a ddv dimensional vector.
●  Softmax layer outputs a |V| dimensional vector (this is nothing but probability
distribution over words).
●  Essentially, we are trying to learn a document representation ddv which can
predict the words in any window on the document.
179
Details
●  Random windows are samples from each document.
●  Document vector is used to make a prediction for words in this window.
●  Cross entropy loss function is used to learn the representation of the word
and each document vector.
180
Generating representation at test time
Sentence : “I got back home.”
181
Evaluation
•  Paragraph vec + 9 words to predict
10th word
•  Input: Concatenates 400 dim. DBOW
and DM vectors.
•  Predicts test-set paragraph vec’s from
frozen train-set word vec’s
Stanford IMDB movie review data set
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
182
Visualization
Visualization of wikipedia paragraph vectors 183
Sentence Similarity
Input sentence - “Distributed Representations of Sentences and Documents”
184
LDA vs para2vec
Terms similar to “machine learning”
185
Drawbacks
●  Inference needs to be performed at test time, for generating vector
representation of a sentence in test corpus.
●  This scales poorly for application which incorporate large amount of text.
186
Drawbacks
●  Inference needs to be performed at test time, for generating vector
representation of a sentence in test corpus.
●  This scales poorly for application which incorporate large amount of text.
187
Hacker’s way for quick implementation
Gensim notebook
gensim notebook
Tensor Flow Implementation
188
Tensorflow implementation
Skip Thoughts
Motivation
●  Although various techniques exist for generating sentence and paragraph
vector, there is lack of generalized framework for sentence encoding.
●  Encode a sentence based on its neighbour( encode a sentence and try to
generate to two neighbouring sentences in the decoding layer).
●  Doc2vec require to perform explicit inference in order to generate the vector
representation of sentence at test time.
190
Introduction to skip-thoughts
●  word2vec skip gram model applied at sentence level.
●  Instead of using a word to predict its surrounding words, use a sentence to
predict their surrounding sentences.
●  Corpus : I got back home. I could see the cat on the steps. This was
strange.
si-1 : I got back home.
si : I could see the cat on the steps.
si+1 : This was strange.
191
Introduction to skip-thoughts
● need ml model that can (sequentially) consume variable length sentences
● And after consumption used the knowledge gained from whole sentence to
predict the neighbouring sentences
● FFN, CNN cannot neither consume sequential text nor have any persistence
192
RNN
●  Motivation: How do humans understand language
○  “How are you ? Lets go for a coffee ? ...”
●  As we read from left to right, we don’t understand each word in isolation,
completely throwing away previous words. We understand each word in
conjunction with our understanding from previous words.
●  Traditional neural networks (FFNs, CNNs) can not reason based on
understanding from previous words - no information persistence.
193
RNN
●  RNN are designed to do exactly this - they have loops in them, allowing
information to persist.
●  In the above diagram, A, looks at input xt and produces hidden state ht. A
loop allows information to be passed from one step of the network to the next.
Thus, using x0 to xt-1 while consuming xt.
Image borrowed from Christopher Olah’s blog
194
●  To better understand the loop in RNN, let us unroll it.
Time
●  The chain depicts information(state) being passed from one step to another.
●  Popular RNNs = LSTM, GRU Image borrowed from Christopher Olah’s blog
195
196
197
In CNN we have parameters shared across space. In RNN parameters are shared across time
198
Architecture of RNN
●  All RNNs have a chain of repeating modules of neural network.
●  In basic RNNs, this repeating module will have a very simple structure, such
as a single tanh layer.
Image borrowed from Christopher Olah’s
199
Image borrowed from suriyadeepan’s
The state consists of a single “hidden” vector h
h
h h h
200
The Dark side
●  RNN's have difficulty dealing with long-range dependencies.
●  “Nitin says Ram is an awesome person to talk to, you should definitely meet
him”.
●  In theory they can “summarize all the information until time t with hidden state
ht”
●  In practice, this is far from true.
201
●  This is primarily due to deficiencies in the training algorithm - BPTT (Back
Propagation Through Time)
●  Gradients are computed via chain rule. So either the gradients become:
○  Too small (Vanishing gradients)
■  Multiplying n of these small gradients (<1) results in even smaller gradient.
○  Too big (Exploding gradients)
■  Multiplying n of these large gradients (>1) results in even larger gradient.
202
LSTM
●  LSTMs are specifically designed to handle long term dependencies.
●  The way they do it is using cell memory: The LSTM does have the ability to
remove or add information to the cell state, carefully regulated by structures
called “gates”.
●  Gates control what information is to be added or deleted.
203
●  “forget gate” decides what information to throw from cell state.
●  It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number
in the cell state Ct−1. A 1 represents “completely keep this” while a 0
represents “completely get rid of this.”
Image borrowed from Christopher Olah’s
204
●  “input gate” decides which values in cell state to update.
●  tanh layer creates candidate values which may be added to the state
Image borrowed from Christopher Olah’s
205
●  “forget gate” & “input gate” come together to update cell state.
Image borrowed from Christopher Olah’s
206
●  “output gate” decides the output.
Image borrowed from Christopher Olah’s
207
●  There are many variants.
●  Each variant has some gates that control what is stored/deleted.
●  At heart of any LSTM implementation are these equations.
●  By making memory cell additive, they circumvent the problem of diminishing
gradients.
●  For exploding gradients - use gradient clipping.
208
GRU
●  GRU units are simplification of LSTM units.
●  Gated recurrent units have 2 gates.
●  GRU does not have internal memory
●  GRU does not use a second nonlinearity for computing the output
209
Details
●  Reset Gate
○  Combine new input with previous memory.
●  Update Gate
○  How long the previous memory should stay.
210
LSTM & GRU Benefits
●  Remember for longer temporal durations
●  RNN has issues for remembering longer durations
●  Able to have feedback flow at different strengths depending on inputs
211
Visual difference between LSTM & GRU
212
Encoding
●  Let x1, x2, … xN be the words in sentence si, where N is the number of words.
●  Encoder produces an output representation at time step t, which is the
representation of the sequence x1, x2, ...xt.
●  Hidden state hi
N is the output representation of the entire sentence.
213
Encoding
Corpus :
I got back home. I could see the cat on the steps. This was strange.
214
Decoding
●  Decoder conditions on the encoder output hi.
●  One decoder is used for next sentence, while another decoder is used for the
previous sentence.
●  Decoders share the vocabulary V, but learn the other parameters separately.
215
Decoding Unit
216
Details
●  Given ht
i+1, the probability of word wt
i+1 given the previous t − 1 words and
the encoder vector is
●  where, denotes the row of V corresponding to the word of wt
i+1
●  Similar computation is performed for the previous sentence st-1
217
Objective Function
●  Given a tuple (si−1, si , si+1), the objective is the sum of the log-probabilities for
the forward(si+1) and backward(si-1) sentences conditioned on the encoder
representation:
●  The total objective is the above summed over all such training tuples.
218
Nearest Neighbour through skip-thoughts
219
220
https://www.youtube.com/watch?v=cersRTtjFcU
References
●  Doc2vec
○  Distributed Representations of Sentences and Documents
○  Medium article
○  Doc2vec tutorial
○  Document Embedding with Paragraph Vectors
○  https://deeplearning4j.org/doc2vec
○  https://groups.google.com/forum/#!topic/gensim/0GVxA055yOU
○  https://amsterdam.luminis.eu/2016/11/15/machine-learning-example/
○  https://github.com/wangz10/tensorflow-playground/blob/master/doc2vec.py
○  https://blog.acolyer.org/2016/06/01/distributed-representations-of-sentences-and-documents/
○  https://deeplearning4j.org/doc2vec
221
●  Skip-thoughts
o  Skip-Thought Vectors
o  https://github.com/ryankiros/skip-thoughts
o  https://www.intelnervana.com/building-skip-thought-vectors-document-understanding/
o  https://gab41.lab41.org/lab41-reading-group-skip-thought-vectors-fec68c05aa92
222
223
Module 4
Char2Vec
Topics:
• Drawbacks of doc2vec
• Character level language modeling
Key Learning outcomes:
•  Character based language models
•  RNNs - LSTM, GRU
•  Magic : RNN + char2vec
•  Extending skipgram, CBOW to characters
•  Tweet2vec
•  Basics of CNN
•  charCNN
225
https://code.facebook.com/posts/1686672014972296/deal-or-no-deal-training-ai-bots-to-negotiate/
Drawbacks
●  Until now we built language models at word/sentence/paragraph/document
level.
●  There are couple of major problems with them:
○  Out Of Vocabulary (OOV) - how to handle missing words ?
○  Low frequency count - Zipf’s Law tells us that in any natural language corpus a majority of
the vocabulary word types will either be absent or occur in low frequency.
○  Blind to subword information - “event”, “eventfully”, “uneventful”, “uneventfully” should have
structurally related embeddings.
228
○  Each word vector is independent - so you may have vectors for “run”, “ran”, “running” but there is
no (clean) way to use them to obtain vector for “runs”. Poor estimate of unseen words.
○  Storage space - have to store large number word vectors. English wikipedia contains 60 million
sentences with 6 billion tokens of which ~ 20 million are unique words. This is typically countered
by capping the vocabulary size.
○  Generative models: Imagine you feed k words/sentences to the model, and ask it to predict (k+1)st
word/sentence.
■  How well is such a model likely to do ?
■  Badly
■  Why ?
■  Large output space. 229
Way forward
●  Construct vector representation from smaller pieces:
○  Morphemes:
■  Meaningful morphological unit of a language that cannot be further divided (e.g. for
‘incoming’ morphemes are : in, come, ing)
■  Ideal primitive. By definition they are minimal meaning bearing units of a language.
■  Given a word, breaking it into morphemes is non-trivial.
■  Requires morphological tagger as preprocessing step (Botha and Blunsom 2014; Luong,
Socher, and Manning 2013)
○  Characters:
■  Fundamental unit
■  Easy to identify
■  How character compose to give meaning is not very clear. “Less”, “Lesser”, “Lessen”,
“lesson”
■  Most languages have a relatively small character set -
230
●  For the rest of this presentation, we will treat text as a sequence of characters
- feeding 1 character at a time to our model.
●  For this we need models that are capable of taking and processing
sequences. FFN, CNN
●  RNN - Recurrent Neural Networks
○  LSTM
○  GRU
231
●  Imagine we are working with english language.
●  Roughly ~70 unique characters.
●  Easiest character embedding - 1 hot vectors in 70 dimension space.
●  Every 2 characters are equally distant(near by). Is there any use of such
embedding ? YES
Simplest char2vec
232
Unreasonable effectiveness of RNN*
●  Blog by Andrej Karpathy in 2015
●  Demonstrated the power of character level language models.
●  Central problem: Given k (continuous) characters (from a text corpora),
predict (k+1)st character.
●  Very very interesting results
* karpathy.github.io/2015/05/21/rnn-effectiveness/
233
●  Shakespeare’s work ●  Linux Source Code
234
●  Algebraic Geometry ●  NSF Research Awards abstracts
235
char2vec : Toy Example
Example training
sequence: “hello”
Vocabulary: [h,e,l,o]
236
Let’s implement it !
●  Take input text (say Shakespeare’s novels), and using a sliding window of
length (k+1) slice the raw text in contiguous chunks of (k+1) characters
●  Split each chunk into (X,y) pairs where first k characters become X and (k
+1)th character is the y. This becomes our training data.
237
●  Map each character to a unique id
●  Say we have d unique characters in our corpus
●  Each character is a vector of d dimensions in 1-hot format
●  A sequence of k characters is : 2d tensor of k x d
●  Dataset X is : 3d tensor of m sequences, each of k x d
●  Y is 2d tensor : m x d. Why ?
k
0
0
0
0
1
d
0
0
0
0
1
k
d
m
238
●  We will use keras
●  A super simple library on top of TF/Theano
●  Meant for both beginners and advanced.
●  Exceptionally useful for quick prototyping.
●  Super popular on kaggle
Almost there ….
239
Char2vec Notebook
240
Some more awesome applications of char2vec
Writing with machine
DeepDrumpf
241
Similar idea applied via CNN
●  Similarly Zhang et al. have applied CNN instead of RNN directly to 1-hot
vectors character vectors.
“Text Understanding from Scratch” Xiang Zhang, Yann LeCun
“Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Junbo Zhao,
Yann LeCun
242
Dense char2vec
●  1-hot encoding of characters is fairly straight forward and very useful.
●  But people have shown learning a dense character level representation can
work even better (improved results or similar results with lesser params).
●  Also results in lesser parameters in input layer and and its subsequent layer.
(though not much) (# of edges between embedding layer and next laye)
●  Simplest way to learn dense character vectors ?
243
CBOW & SkipGram
●  Original CBOW and Skip-Gram were based on words.
●  Use the same architecture, but character level i.e.
○  CBOW = given characters in context, predict the target character
○  Skip Gram = given target character, predict characters in context
244
We have given the notebook for character level skip-gram.
Notebook for character level CBOW : take home assignment !
245
How good is the embedding ?
●  Word vectors or document vectors are evaluated using both intrinsic and
extrinsic evaluation.
●  Character vectors have only extrinsic evaluation.
●  Makes no sentence to say something like r : s :: a : b
●  Even from human perspective, a character has no meaning on its own.
●  Building character embedding is relatively cheap, hence most tasks specific
architectures have this component built into them.
Man : King :: Woman : Queen Sentiment analysis
246
Tweet2Vec*
●  Twitter - Informal language, slang, spelling errors,
abbreviations, new and ever evolving vocabulary,
and special characters.
●  For most twitter corpuses : size of vocabulary is
~30-50% of number of documents.
●  Can not use word level approaches - very large
vocabulary size.
●  Not only this makes it practically infeasible but also
affects the quality of word vectors. Why ?
* Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et
al.
247
Task
●  Given a tweet, predict its hashtag.
●  “Shoutout to @xfoml Project in rob wittig talk #ELO17”
●  Super easy to collect a dataset.
248
Designing N/W
●  raw characters character embedding bi-directional GRU
●  Why bi-directional GRU (BGRU) ?
○  Language is not just a forward sequence.
○  “He went to ___?___”
○  “He went to ___?___ to buy grocerry”
○  Its both past words and future words that determine the missing word.
○  (BGRU) exploits this - it has 2 independent GRU networks. One consumes text in
forward direction while other in backward direction.
249
Architecture
Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al.
250
Loss function
●  Final tweet embedding is used to produce score for every hashtag.
●  Scores are converted to probability using softmax
●  This gives a distribution over hashtags.
●  This is compared against true distribution.
●  Cross entropy is used to measure the gap between 2 distributions.
●  This is loss function(J)
251
Results
Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al.
252
Results
Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al.
Rank of predicted hashtag
253
Time to Code
254
Char CNN
●  Convolutional Neural Nets (CNN)* have been super successful in the area of
vision.
●  CNN treats image as a signal in spatial domain.
●  Can it be applied to text ? Yes
○  Text = stream of characters
○  Since characters come one after another - this is signal in time domain
○  Embedding matrix as input matrix* LeNet-5 by Yann LeCun
Pixels spread in space. Position of each pixel
is fixed. Changing that will change the image
Characters spread in time. Position (1d) of
each character is fixed. Changing that will
change the sentence.
255
Basics of CNN
●  Input : Image
●  Image is nothing but a signal in space.
●  Represented by matrix with values (RGB)
●  Each value ~ wavelength of Red, Green and Blue signals respectively.
Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al.
256
CNN architecture
●  2 key operations:
○  Convolution
○  Pooling
257
●  In simplest terms : given 2 signals x() and h(), convolution combines the
2 signals:
●  In the discrete space:
●  For our case image is x()
●  h() is called filter/kernel/feature detector. Well known concept in the world
of image processing.
Convolution
258
●  Ex: Filters for edge
detection, blurring,
sharpen, etc
●  It is usually a small
matrix - 3x3, 5x5, 5x7
etc
●  There are well known
predefined filters
https://en.wikipedia.org/wiki/Kernel_(image_processing)
259
1 0 1
0 1 0
1 0 1
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
4
1*1 + 1*0 + 1*1
0*0 + 1*1 + 1*0
0*1 + 0*0 + 1*1
●  Convolved feature is nothing but taking a part of the image and applying
filter over it - taking pairwise products and adding them.
260
●  Convolved feature map is nothing but sliding the filter over entire image
and applying convolution at each step, as shown in diagram below:
1 0 1
0 1 0
1 0 1
https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-
Filter
261
●  Image processing over past many decades has
built many filters for specific tasks.
●  In DL (CNN) rather than using predefined filters,
we learn the filters.
●  We start with small random values and update
them using gradients
●  Stride: by how much we shift the filter.
? ? ?
? ? ?
? ? ?
262
●  It’s a simple technique for down sampling.
●  In CNNs, downsampling, or "pooling" layers are often placed after
convolutional layers.
●  They are used mainly to reduce the feature map dimensionality for
computational efficiency. This in turn improves actual performance.
●  Takes disjoint chunks of the image (typically 2×22×2) and aggregates
them into a single value.
●  Average, max, min, etc. Most popular is max-pooling.
Pooling
https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html
263
Putting it all together
https://adeshpande3.github.io
264
Deep Learning + Language Modeling
•  Traditionally uses architecture such as Recurrent Neural Networks (RNN).
•  Sequential processing : one unit after other.
•  Over time advancements happened and concepts like : 2 way ordering
(Bidirectional), memory(LSTM), attention etc got added.
•  Some people explored the possibility of using CNN for Language modeling:
•  Pixels spread in space. So they are nothing but signal in space.
•  Words/tokens/characters spread in time. So they are nothing but signal in time.
265
CNNs for Language Modeling
266
•  Input for any NLP task are sentences/paras/docs in the form of matrix
•  Each row of this matrix represents a unit/token of text – character, morpheme,
word etc (typically row = 1-hot or embedding representation of that unit)
•  Unlike images, where filter slides over local patches of an image; in NLP we
typically use filters that slide over full rows of the matrix i.e. the “width” of our
filters is usually the same as the width of the input matrix. [1D or temporal
convolutions]
•  The height, or region size varies. Typically, window slides over 2-5 words at a
time.
267
268
•  Lots of success of CNNs is attributed to :
•  Location Invariance : where a object in a image comes doesn’t matter so much
•  Local Compositionality : bunch of local objects combine/compose to give more complex
objects.
269
•  In CNN+NLP, both aforementioned properties go for a toss
•  Where a word comes in a sentence can change the meaning drastically.
○  Man bites dog.
Dog bites man.
•  Parts of phrases could be separated by several other words. Words do compose in some
ways, but how exactly this works, what higher level representations actually “mean” – these
aren’t as obvious as in the Computer Vision case.
○  “Tim said Robert has lot of experience, he feels you should definitely meet him”
•  Both key advantages gone, why are we even thinking of applying CNNs to
text ? RNNs should be the way to go.
270
•  “All models are wrong, but some are useful”
•  This is not about CNNs vs RNNs (may be both are bad!)
•  This is about
•  Understanding key difficulties
•  Are there some aspects of language modeling where CNNs can do a better job.
•  Helps us to better understand strength & weakness of each model.
•  Turns out that CNNs applied to certain NLP problems perform quite well. Esp
classification tasks - Sentiment Analysis, Spam Detection or Topic
Categorization.
•  CNNs are usually fast, very fast.
271
Major works in this sub-area
•  Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP 2014
•  Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment
Analysis of Short Texts. COLING-2014
•  Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with
Convolutional-Pooling Structure for Information Retrieval. CIKM ’14.
•  Santos, C., & Zadrozny, B. (2014). Learning Character-level Representations for Part-of-
Speech Tagging. ICML-14.
•  Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text
Classification, 1–9.
•  Wenpeng Yin, Hinrich Schutze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: attention-based
convolutional neural network for modeling sentence pairs.
272
•  Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hinrich Schutze. 2016. Combining recurrent
and convolutional neural networks for relation classification. In Proceedings of NAACL HLT.
pages 534–539.
•  Ying Wen, Weinan Zhang, Rui Luo, and Jun Wang.2016. Learning text representation using
recurrent convolutional neural network with highway layers. SIGIR Workshop on Neural
Information Retrieval
•  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with
gated convolutional networks. arXiv preprint arXiv:1612.08083
•  Wenpeng Yin, Katharina Kann, Mo Yu and Hinrich Schutze Comparative Study of CNN and
RNN for Natural Language Processing
•  Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-Aware Neural Language
Models. (Uses a hybrid of CNN and RNN)
273
Deep Dive
274
Character-Aware Neural Language Models *
•  Problem statement: Given t words w1, w2, ….., wt ; predict wt+1
•  Traditional models : words fed as inputs in the form of word embedding.
•  Here input embedding is replaced by output of character level CNN.
•  Uses sub word information.
•  Traditionally sub word information is fed in terms of morphemes;
●  Unbreakable : Un ("not") – break (root word) – able (“can be done”)
* “Character-Aware Neural Language Models” Y kim et. al 2015 275
•  Identifying morphemes is non trivial. Requires morphological tagging as
preprocessing.
•  Y Kim et. al leverage sub word via through a character-level CNN.
•  Learn embedding for each character.
•  A word w is then nothing but embeddings of it constituent characters.
•  For each word, we apply convolution on its character embeddings to obtain features.
•  These are then fed to LSTM via highway layers.
•  Does not use word embeddings at all.
•  In most language models, large % of parameters are because of word
embeddings. Thus, we get much smaller number of parameter to learn.
276
Details
●  C - vocabulary of characters.
●  D - dimensionality of character embeddings.
●  R - matrix character embeddings.
● Let word wk = [c1,....,cl] i.e. made from l characters, where
l is length of wk
● Character-level representation of wk is given by matrix
● Ck ∈  ℝ D X l, where jth column corresponds to character
embedding for jth character of word wk
●  Apply filter/kernel H to Ck to obtain feature map fk.
●  ith element of fk is given by:
● 
is not : ith to (i-w+1)th columns of Ck
●  is called Frobenius product
|C|
D R
l
D Ck
c1 c2 cl
l - w +1
fk 277
•  To capture most important feature - we take max over time
●  yk is the feature corresponding to filter H when applied to word wk.
●  (~ find most important character n-gram)
•  Likewise, they apply multiple h filters : H1, …., Hh.
•  Then, yk = is the input representation of word wk.
●  At this point of time we can either:
•  Construct MLP over yk
•  Feed yk to LSTM
278
● Instead to gain improvements, rather than feeding yk to LSTM, they pass it via Highway
network*
●  Highway network:
● Basic idea: carry some part input directly to output.
● While remaining input is processed and then taken forward.
● Very similar to residual networks.
● F() is typically : affine transformation followed by tanh.
● In Highway networks, we learn “what parts of input to be
● carried forward via highway”
● This is done via gating mechanism called transform gate (t) and carry gate (1-t)
279
280
In nutshell
Results
281
Key take home
•  CNNs + NLP surely holds lot of promise.
•  Pretty successful in classification setting.
•  Can prove be great tool to model the input aspects of NLP.
•  What about non-classification settings ?
•  Sequence labeling (NER)
•  Sequence generation (MT)
•  As of today not so successful
•  Though people have tried lot of ideas there too.
•  de-convolutions in generative settings
•  Some architectures use different embeddings as different channels. 282
More Resources
•  https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
•  https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f
•  wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
•  https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with-
convolutional-neural-networks-on-microsoft-azure/
•  https://www.aclweb.org/anthology/P/P14/P14-1062.xhtml
•  https://github.com/yoonkim/lstm-char-cnn
•  https://github.com/yoonkim/CNN_sentence
•  https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32
•  “Comparative Study of CNN and RNN for Natural Language Processing” Wenpeng Yin et. al 2017, arXiv:
1702.01923 [cs.CL]
283
284
References
●  Tweet2vec:
○  “Character-based Neural Embeddings for Tweet Clustering” - Vakulenko et. al
○  “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network” - Sakaguchi et. al
●  Basics of CNN
○  https://adeshpande3.github.io
285
●  CNN on text:
○  https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f
○  https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-
embedding-and-convolutional-neural-networks-on-keras-163197aef623
○  Seminal paper - “Convolutional Neural Networks for Sentence Classification” Y kim
○  “Text Understanding from Scratch” Xiang Zhang, Yann LeCun
○  “Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Yann LeCun
○  “Character-Aware Neural Language Models” Y kim
●  Character Embeddings:
○  “Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Yann LeCun
○  “Character-Aware Neural Language Models” Y kim
○  “Exploring the Limits of Language Modeling” Google brain team.
○  “Finding Function in Form: Compositional Character Models for Open Vocabulary Word
286
Summary
● We learnt various ways to build representation at :
○  Word level
○  sentence/paragraph/document level
○  character level
● We discussed the key architectures used in representation learning and
fundamental ideas behind them.
● Core idea being : context units and target units.
● We also saw strengths and weaknesses of each of these ideas.
287
● Start with pretrained embeddings. This serves as baseline.
● Use rigorous evaluation - both intrinsic and extrinsic.
● If you have lot of data, fine tuning pretrained embeddings can improve
performance on extrinsic task.
● If your dataset is small - worth trying GloVe. Don’t try fine tuning.
● Embeddings and task are closely tied. An embedding that works beautifully for
NER might fail miserably for sentiment analysis.
○  “It was a great movie”
○  “Such a boring movie”
If you are training word vectors and in your corpus “great” and “boring” come in similar context, then their
vectors will be closer in embedding space. Thus, they may be difficult to separate.
288
● Hyperparameter matter : many a times key distinguisher.
● Character embeddings are usually task specific. Thus, they often tend to do
better.
● However, character embeddings can be expensive to train.
● Building blocks are same: new architectures can be build using the same
principles.
● State of the art (for practitioners) - FastText from facebook.
○  trains embeddings for character n-gram
○  Character n-gram(“beautiful”) : {“bea”, “eau”, “aut”, ………}
289
•  Please upvote the repo
•  Run the notebooks. Play, experiment with them. Break them.
•  If you come across any bug, please open a issue on our github repo.
•  Want to contribute to this repo, great ! Pls contact us
•  https://github.com/anujgupta82/Representation-Learning-for-NLP
Thank You 290

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 

Was ist angesagt? (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
BERT
BERTBERT
BERT
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Nlp
NlpNlp
Nlp
 

Ähnlich wie NLP Bootcamp 2018 : Representation Learning of text for NLP

Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
Aravind Reddy
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
Aravind Reddy
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
SHIBDASDUTTA
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 

Ähnlich wie NLP Bootcamp 2018 : Representation Learning of text for NLP (20)

NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
NLP,expert,robotics.pptx
NLP,expert,robotics.pptxNLP,expert,robotics.pptx
NLP,expert,robotics.pptx
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en Azure
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
ARTIFICIAL INTELLIGENCE---UNIT 4.pptx
ARTIFICIAL INTELLIGENCE---UNIT 4.pptxARTIFICIAL INTELLIGENCE---UNIT 4.pptx
ARTIFICIAL INTELLIGENCE---UNIT 4.pptx
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
P-1.1.9.ppt
P-1.1.9.pptP-1.1.9.ppt
P-1.1.9.ppt
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 

Mehr von Anuj Gupta

Mehr von Anuj Gupta (9)

ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
Sarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysisSarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysis
 
Recent Advances in NLP
  Recent Advances in NLP  Recent Advances in NLP
Recent Advances in NLP
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
 
Synthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural NetsSynthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural Nets
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Representation Learning for NLP
Representation Learning for NLPRepresentation Learning for NLP
Representation Learning for NLP
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

NLP Bootcamp 2018 : Representation Learning of text for NLP

  • 1. Representation Learning of Text for NLP Anuj Gupta @anujgupta82 anujgupta82@gmail.com anujgupta-82
  • 2. About Instructor Anuj is Director – Machine Learning at Huawei Technologies. Prior to this he was heading ML efforts at FreshWorks, Airwoot and Droom; working in the area of NLP, Vision, Machine Learning, Deep learning Speaker at prestigious forums like Anthill, PyData, Fifth Elephant, ICDCN, PODC, IIT Delhi, IIIT Hyderabad. Co-organizer special interest groups like DLBLR. @anujgupta82 anujgupta82@gmail.com
  • 3. Objective of this Workshop •  Deep dive into state-of-the-art techniques for representing text data. •  By the end of this workshop, you would have gained a deeper understanding of key ideas, maths and code powering these techniques. •  You will be able to apply these techniques in solving NLP problems of your interest. •  Help you achieve higher accuracies. •  Help you achieve deeper understanding. •  Target audience: Data science teams, industry practitioners, researchers, enthusiast in the area of NLP •  This will be a very hands-on workshop 4
  • 4. I learn best with toy code that I can play with. But unless you know key concepts, you can’t code. In this workshop, we will do both 5
  • 5. Outline Workshop is divided into 4 modules. We will cover module 1 and 2 Day 1. Module 3 and 4 on Day 2. The github repo has folders for each of the 4 modules containing respective notebooks. • Module 1 • Archaic techniques • Module 2 • Word vectors • Module 3 • Sentence/Paragraph/Document vectors • Module 4 • Character vectors 6
  • 6. • Check List • Github repo installed • Jupyter up and running • Loud and clearly audible • Ground Rules • Got a question ? Stop me then & there and ask. • No questions are stupid. • Please respect other’s time. • ML ≠ Software Engineering • ML is not merely code. • Garbage out Garbage in • If a model (does not) work well, you need to understand why it does so. • Fairness, Bias • These are often driven by the assumptions, hypothesis and maths of the model 7
  • 9. Resurrect your dead friend as an AI 10 Luka - Eugenia lost her friend Roman in an accident. Determined not to lose his memory, she gathered all the texts Roman sent over his short life and made a chatbot – a program that responds automatically to text messages. Now whenever she is missing Roman, Eugenia sends the chatbot a message and Roman’s words respond.
  • 12. Topics •  Introduction to NLP •  Examples of various NLP tasks •  Archaic Techniques •  Using pretrained embeddings Key Learning outcomes: •  Basics of NLP •  One hot encoding •  Bag of words •  N-gram •  TF-IDF •  Why these techniques are bad •  How can you use pretrained embeddings •  Issues with using pre-trained word embeddings 13
  • 13. What is NLP •  Concerned with programming computers to fruitfully process large natural language. •  It is at the intersection of computer science, artificial intelligence and computational linguistics 14
  • 14. Some NLP tasks: •  Spell Checking •  Finding Synonyms • Keyword Search 15
  • 16. •  Co-reference (e.g. What does "he" or "it" refer to given a document?) 17
  • 17. •  Machine Translation (e.g. translate Chinese text to English) 18
  • 18. NLP is not easy ! 19
  • 19. Input to any NLP system •  Cannot directly feed the raw text to machine learning algorithms; •  One must first convert to some numerical form. •  Why ? ML algorithms assume that all features used to represent an observation are numeric •  This conversion from raw text to a suitable numerical form is called text representation. 20
  • 20. • Example: we wish to build a system for sentiment analysis • Sentiment is embedded in the meaning. Hence, to correct predict sentiment, we understand the meaning of the sentence. • To extract the right meaning from a sentence, following are the most crucial data points: • Break the sentence into lexical units and derive the meaning for each of the lexical units. • Understand syntactic (grammatical) structure of the sentence. • Understand the context in which the sentence appears. • The semantics (meaning) of the sentence comes from the above 3 points combined together. 21
  • 21. • The text representation scheme that we choose to represent our text, must facilitate the extraction of the above mentioned data points in the best possible manner. • The process of extracting these data points is also called feature extraction or feature encoding. • Only once we have extracted the right features, can one aim to use a suitable machine learning algorithm that can better utilize these features and deliver satisfactory results. 22
  • 22. 23 NLP pipeline Raw Text Preprocessing Tokenization to get language units. Mathematical representation of language unit Build train/test data Train model using training data Test the model on test data The first and arguably most important common denominator across all NLP tasks is : how we represent text as input to our models.
  • 23. • Machine does not understand text and need a numeric representation. • Images have a natural representation scheme • RGB matrix), for text there is no obvious way • An integral part of any NLP pipeline Why text representation is important? 24
  • 24. • Like images, speech also has a very natural way • For text there is no obvious way Why text representation is important? 25
  • 25. • An integral part of any NLP pipeline • Representation learning • set of techniques that learn features : a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. • Part of feature engineering/learning. • Get rid of “hand-designed” features and representation • Unsupervised feature learning - obviates manual feature engineering What & Why Representation learning 26
  • 27. Vector Space Models • Represent text units (characters, phonemes, words, sentences, paragraphs, documents) as vectors of numbers. • Vector space model or term vector model - an algebraic model for representing text documents as vectors. • Similarity between 2 documents = cosine similarity. Cosine of the angle between the vectors. • We will see VSM in various flavors 28
  • 28. •  One hot encoding •  Bag of words •  N-gram •  TF-IDF 29 Legacy Techniques
  • 29. One hot encoding •  Map each word to a unique ID •  Typical vocabulary sizes will vary between 10k and 250k. 30
  • 30. •  Use word ID, to get a basic representation of word through. •  This is done via one-hot encoding of the ID •  one-hot vector of an ID is a vector filled with 0s, except for a 1 at the position associated with the ID. •  ex.: for vocabulary size D=10, the one-hot vector of word (w) ID=4 is e(w) = [ 0 0 0 1 0 0 0 0 0 0 ] 31 •  Begins by building a dictionary that maps the vocabulary of the corpus to identifiers •  Map each word to a unique ID
  • 31. 32 •  One-hot encoding makes no assumption about word similarity •  all words are equally similar/different from each other •  this is a natural representation to start with, though a poor one
  • 32. Drawbacks • Size of input vector scales with size of vocabulary • Must pre-determine vocabulary size. • Cannot scale to large or infinite vocabularies (Zipf’s law!) • Computationally expensive - large input vector results in far too many parameters to learn. • “Out-of-Vocabulary” (OOV) problem • How would you handle unseen words in the test set? • One solution is to have an “UNKNOWN” symbol that represents low- frequency or unseen words 33
  • 33. • No relationship between words • Each word is an independent unit vector •  D(“cat”, “refrigerator”) = D(“cat”, “cats”) •  D(“spoon”, “knife”) = D(“spoon”, “dog”) • In the ideal world… • Relationships between word vectors reflects relationships between words • Features of word embeddings reflect features of words • These vectors are sparse: • Vulnerable to overfitting: sparse vectors most computations go to zero resultant loss function has very few parameters to update. 34
  • 34. Bag of Words • Analyse different “bags of words” and classify them accordingly. • Vocab = set of all the words in corpus • Document = Words in document w.r.t vocab with multiplicity Sentence 1: "The cat sat on the hat" Sentence 2: "The dog ate the cat and the hat” Vocab = { the, cat, sat, on, hat, dog, ate, and } Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 } Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1} 35
  • 35. Pros + Quick and Simple + This is a very natural scheme for text representation. + Captures multiplicity of word occurrence in a document. + Documents with same words will have their vectors closer to each other in euclidean space as compared to documents with completely different words. S1 : Dog bites man. S2 : Man bites dog. S3 : Dog eats meat. S4 : Man eats food. One possible assignment is: dog = 1, bites = 2, man = 3, meat = 4, food = 5 and eats = 6 S1 : [1,1,1,0,0,0], S2 : [1,1,1,0,0,0], S3 : [1,0,0,1,0,1], S4 : [0,0,1,1,0,1] 36
  • 36. Cons - Too simple - Orderless - No notion of syntactic/semantic similarity - Does not capture the similarity between different words that mean the same. Say, 3 sentences - ‘I run’, ‘I ran’ and ‘I ate’. All three will equally apart. - Out of vocabulary words are simply ignored. There is no provision to handle new words at test time. Only way out ‘UNK’ token and factor ‘UNK’ token at train time. - Word ordering is lost, hence context of words is lost. In bag of words scheme word ordering of words does not matter. It is only only the frequency of words that gets captured. 37
  • 37. N-gram model • Attempt to incorporate word ordering into the encoded vector. •  break the sentences/documents into chunks of n contiguous words/tokens • Vocab = set of all n-grams in corpus. • Document = n-grams in document w.r.t vocab with multiplicity For bigram: Sentence 1: "The cat sat on the hat" Sentence 2: "The dog ate the cat and the hat” Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and, and the} Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0} Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1} 38
  • 38. Pros & Cons + Tries to incorporate order of words - Very large vocab set - No notion of syntactic/semantic similarity 39
  • 39. Term Frequency–Inverse Document Frequency (TF-IDF) • Captures importance of a word to a document in a corpus. • Importance increases proportionally to the number of times a word appears in the document; but is inversely proportional to the frequency of the word in the corpus. • TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). • IDF(t) = log (Total number of documents / Number of documents with term t in it). • TF-IDF (t) = TF(t) * IDF(t) 40
  • 40. • S1 : Dog bites man. S2 : Man bites dog. S3 : Dog eats meat. S4 : Man eats food. • the idf values for the terms are: dog = log2(4/3) = 0.4114 bites = log2(4/2) = 1 man = log2(4/3) =0.4114 • tf values. Since each term appears exactly once and each document has exactly 3 terms., tf score for each term is ⅓ • Therefore, tf-idf scores are: dog = 0.4114 * 0.33 = 0.135 bites = 1* 0.33 = 0.33 man = 0.4114 * 0.33 = 0.135 41
  • 41. Pros & Cons • Pros: • Easy to compute • Has some basic metric to extract the most descriptive terms in a document • Thus, can easily compute the similarity between 2 documents using it • Disadvantages: • Based on the bag-of-words (BoW) model, therefore it does not capture position in text, semantics, co-occurrences in different documents, etc. • Thus, TF-IDF is only useful as a lexical level feature. • Cannot capture semantics (unlike topic models, word embeddings) 42
  • 43. Bottom Line • More often than not, how rich your input representation is has huge bearing on the quality of your downstream ML models. • For NLP, archaic techniques treat words as atomic symbols. Thus every 2 words are equally apart. • They don’t have any notion of either syntactic or semantic similarity between parts of language. • This is one of the chief reasons for poor/mediocre performance of NLP based models. But this has changed dramatically in past few years 44
  • 46. Topics • Word level language models • tSNE : Visualizing word-embeddings • Demo of word vectors. Key Learning outcomes: •  Key ideas behind word vectors •  Maths powering their formulation •  Bigram, SkipGram, CBOW •  Train your own word vectors •  Visualize word embeddings •  GloVe •  How GloVe different Word2Vec •  Evaluating word vectors •  tSNE •  how is tSNE it different from PCA 47
  • 47. Distributional & Distributed Representations 48
  • 48. Distributional representations •  Linguistic aspect. •  Based on co-occurrence/ context •  Distributional hypothesis: linguistic units with similar distributions have similar meanings. •  Meaning is defined by the context in which a word appears. This is ‘connotation’. •  This is contrast with ‘denotation’ - literal meaning of a word. Rock-literally means a stone but can also be used to refer to a person as solid and stable. “Anthill rocks” •  The distributional property is usually induced from document or context or textual vicinity (like sliding window). 49
  • 49. Distributed representations •  Compact, dense and low dimensional representation. •  Differs from distributional representations as the constraint is to seek efficient dense representation, not just to capture the co-occurrence similarity. •  Each single component of vector representation does not have any meaning of its own. Meaning is smeared across all dimensions. •  The interpretable features (for example, word contexts in case of word2vec) are hidden and distributed among uninterpretable vector components. 50
  • 50. •  Embedding: Mapping between space with one dimension per linguistic unit (character, morpheme, word, phrase, paragraph, sentence, document) to a continuous vector space with much lower dimension. •  For the rest of this presentation, “meaning” of linguistic unit is represented by a vector of real numbers. 51 good
  • 51. Using pre-trained word embeddings •  Most popular - Google’s word2vec, Stanford’s GloVe •  Use it as a dictionary - query with the word, and use the vector returned. •  Sentence (S) - “The cat sat on the table” •  Challenges: •  Representing sentence/document/paragraph. •  sum •  Average of the word vectors. •  Weighted mean 52
  • 52. •  Handling Out Of Vocabulary (OOV) words. •  Transfer learning (i.e. fine tuning on data). 53
  • 53. For the rest of this presentation we will see various technique to build/ train our own embeddings 54
  • 56. John Rupert Firth “You shall know a word by the company it keeps” -1957 • English linguist • Most famous quote in NLP (probably) • Modern interpretation: Co-occurrence is a good indicator of meaning • One of the most successful ideas of modern statistical NLP 57
  • 57. Co-occurrence with SVD •  Define a word using the words in its context. •  Words that co-occur •  Building a co-occurrence matrix M. Context = previous word and next word Corpus ={“I like deep learning.” “I like NLP.” “I enjoy flying.”} 58
  • 58. •  Imagine we do this for a large corpus of text •  row vector xdog describes usage of word dog in the corpus •  can be seen as coordinates of point in n-dimensional Euclidean space Rn •  Reduce dimensions using SVD = M 59
  • 59. •  Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n •  M = U Σ VT •  U is an m × m orthogonal matrix (UUT = I) •  Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0, where r = min(m, n)) [σi’s are known as singular values] •  V is an n × n orthogonal matrix (VVT = I) •  We construct M’ s.t. rank(M’) = k • We compute M’ = U Σ’ V, where Σ’ = Σ with k largest singular values •  k captures desired percentage variance •  Then, submatrix U v,k is our desired word embedding matrix. 60
  • 60. Result of SVD based Model K = 2 K = 3 61
  • 61. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence Rohde et al. 2005 62
  • 62. Pros & Cons + Simple method + Captures some sense (though weak) of similarity between words. -  Matrix is extremely sparse. -  Quadratic cost to train (perform SVD) -  Drastic imbalance in frequencies can adversely impact quality of embeddings. -  Adding new words is expensive. Take home : we worked with statistics of the corpus rather than working with the corpus directly. This will recur in GloVe 63
  • 63. BiGram Model Idea:  Directly  learn  low-­‐dimensional  word  vectors  ?   64
  • 64. Language Models •  Filter out good sentences from bad ones. •  Good = semantically and syntactically correct. •  Modeled this via probability of given sequence of n words Pr (w1, w2, ….., wn) •  S1 = “the cat jumped over the dog”, Pr(S1) ~ 1 •  S2 = “jumped over the the cat dog”, Pr(S2) ~ 0 65
  • 67. BiGram Model •  Objective : given wi , predict wi+1 •  Training data: given sequence of n words < w1, w2, ….., wn >, extract bi-gram pairs (wi-1 , wi) •  Knowns: •  input – output training examples : (wi-1 , wi) •  Vocab of training corpus (V) = U (wi) •  Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. •  Model : shallow net 68
  • 69. •  Feed index of wi-1 as input to network. •  Use index to lookup embedding matrix. •  Perform affine transform on word embedding to get a score vector. •  Compute probability for each word. •  Set 1-hot vector of wi as target. •  Set loss = cross-entropy between probability vector and target vector. Steps 70
  • 72. ● Per word, we have 2 vectors : 1.  As row in Embedding layer (E) 2.  As column in weights layer (used for afine transformation) ● It’s common to take average of the 2 vectors. ● It’s common to normalise the vectors. Divide by norm. ● An alternative way to compute ŷi : # (wi, wi-1) / # (wj, wi-1) ∀ j∈V ● Use co-occurrence matrix to compute these counts. Remarks 73
  • 73. I learn best with toy code, that I can play with. - Andrew Trask jupyter notebook 1 74
  • 75. CBOW •  Continuous Bag of words. •  Proposed by Mikolov et al. in 2013 •  Conceptually, very similar to Bi-gram model •  In the bigram model, there were 2 key drawbacks: 1.  The context was very small – we took only wi-1 , while predicting wi 2.  Context is not just preceding words; but following words too. 76
  • 76. •  “the brown cat jumped over the dog” Context = the brown cat over the dog Target = jumped •  Context window = k words on either side of the word to be predicted. •  Pr (w1, w2, ….., wn) = ∏ Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k) •  W = total number of unique windows •  Each window is sliding block 2c+1 words 77
  • 77. CBOW Model •  Objective : given wc−k, . . . , wc−1, wc+1, . . . , wc+k , predict wc •  Training data: given sequence of n words < w1, w2, ….., wn >, for each window extract context and target (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc ) •  Knowns: •  input – output training examples : (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc ) •  Vocab of training corpus (V) = ∪(wi) •  Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. 78
  • 79. •  Feed indexes of (x(c−k) , ... , x(c−1) , x(c+1) , ... , x(c+k)) for the input context of size k. •  Use indexes to lookup embedding matrix. •  Average these vectors to get vˆ = (vc−k+vc−1+...+vc+1+vc+k ) / 2m •  Perform affine transform on vˆ to get a score vector. •  Turn scores in probabilities for each word. •  Set 1-hot vector of wc as target. •  Set loss = cross-entropy between probability vector and target vector. Steps 80
  • 80. Maths behind the scene •  Optimization objective J = - log Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k) •  Maximizing Pr() = Minimizing – log Pr() •  Let vˆ = (wc−k + . . . + wc−1 + wc+1 + . . . + wc+k )/ 2m •  Then, RHS •  gradient descent to update all relevant word vectors uc and wj. 81
  • 81. Skip-Gram model •  2nd model proposed by Mikolov et al. in 2013 •  Turns CBOW over its head. •  CBOW = given context, predict the target word •  Skip Gram = given target, predict context •  “the brown cat jumped over the dog” Target = jumped Context = the, brown, cat, over, the, dog 82
  • 82. •  Objective : given wc , predict wc−k, . . . , wc−1, wc+1, . . . , wc+k •  Training data: given sequence of n words < w1, w2, ….., wn >, for each window extract target and context pairs (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k) •  Knowns: •  input – output training examples : (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k) • Vocab of training corpus (V) = ∪ (wi) •  Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. 83
  • 84. •  Feed index of xc •  Use index to lookup embedding matrix. •  Perform affine transform on vˆ to get a score vector. •  Turn scores in probabilities for each word. •  Set 1-hot vector of wc as target. •  Set loss = cross-entropy between probability vector and target vector. Steps 85
  • 85. Maths behind the scene •  Optimization objective J = - log Pr(wc−k, . . . , wc−1, wc+1, . . . , wc+k | , wc) •  gradient descent to update all relevant word vectors uc and wj. 86
  • 87. •  How to quantitatively evaluate the quality of word vectors? •  Intrinsic Evaluation : •  Word Vector Analogies •  Extrinsic Evaluation : •  Downstream NLP task 88
  • 88. Intrinsic Evaluation •  Specific Intermediate subtasks •  Easy to compute. •  Analogy completion: •  a:b :: c:? d = man:woman :: king:? •  Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions •  Discarding the input words from the search! •  Problem: What if the information is there but not linear? 89
  • 89. 90
  • 90. Extrinsic Evaluation •  Real task at hand •  Ex: Sentiment analysis. •  Not very robust. •  End result is a function of whole process and not just embeddings. •  Process: •  Data pipelines •  Algorithm(s) •  Fine tuning •  Quality of dataset 91
  • 92. Bottleneck •  Recall, to calculate probability, we use softmax. The denominator is sum across entire vocab. •  Further, this is calculated for every window. •  Too expensive. •  Single update of parameters requires to iterate over |V|. Our vocab usually is in millions. 93
  • 93. To approximate probability, dont use the entire vocab. There are 2 popular line of attacks to achieve this: • Modify the structure the softmax • Hierarchical Softmax •  Sampling techniques : don’t use entire vocabulary to compute the sum •  Negative sampling 94
  • 94. ●  Arrange words in vocab as leaf units of a balanced binary tree. ●  |V| leaves |V| - 1 internal nodes ●  Each leaf node has a unique path from root to the leaf ●  Probability of a word (leaf node Lw) = Probability of the path from root node to leaf Lw ●  No output vector representation for words, unlike softmax. ●  Instead every internal node has a d-dimension vector associated with it - v’n(w, j) Hierarchical Softmax n(w, j) means the j-th unit on the path from root to the word w
  • 95. ●  Product of probabilities over nodes in the path ●  Each probability is computed using sigmoid ●  ●  Inside it we check : if (j+1)th node on path left child of jth node or not ●  v’n(w, j) T h : vector product between vector on hidden layer and vector for the inner node in consideration.
  • 96. ●  p(w = w2) ●  We start at root, and navigate to leaf w2 ●  ●  ●  p(w = w2) ●  Example
  • 97. ●  Cost: O(|V|) to O(log ⁡|V| ) ● In practice, use Huffman tree
  • 98. Negative Sampling ● Given (w, c) : word and context ● Let P(D=1|w,c) be probability that (w, c) came from the corpus data. ● P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data. ●  Lets model P(D=1|w,c) with sigmoid: ● Objective function (J): ○  maximize P(D=1|w,c) if (w, c) is in the corpus data. ○  maximize P(D=0|w,c) if (w, c) is not in the corpus data. ● We take a simple maximum likelihood approach of these two probabilities.
  • 99. θ is parameters of the model. In our case U and V - input, output word vectors. Took log on both side
  • 100. ● Now, maximizing log likelihood = minimizing negative log likelihood. ●  ●  D ̃ is “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over" ●  Generate D ̃ on the fly by randomly sampling this negative from the word bank. ●  For skip-gram, our new objective function for observing the context word wc − m + j given the center word wc would be : regular softmax loss for skip-gram
  • 101. ●  Likewise for CBOW, our new objective function for observing the center word uc given the context vector ●  In the above formulation, {u˜k |k = 1 . . . K} are sampled from Pn(w). ●  best Pn(w) = Unigram distribution raised to the power of 3/4 ●  Usually K = 20-30 works well. regular softmax loss for CBOW
  • 102. GloVe
  • 104. Global matrix factorization methods ●  Use co-occurrence counts ●  Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert) + Fast training +  Efficient usage of statistics + Captures word similarity -  Do badly on analogy tasks -  Disproportionate importance given to large counts 105
  • 105. Local context window method ●  Use window to determine context of a word ●  Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert & Weston) +  Capture word similarity. +  Also performance better on analogy tasks -  Slow down with increase in corpus size -  Inefficient usage of statistics 106
  • 106. Combining the best of both worlds ●  Glove model tries to combine the two major model families :- ○  Global matrix factorization (co-occurrence counts) ○  Local context window (context comes from window) = Co-occurrence counts with context distance 107
  • 107. Co-occurrence counts with context distance ●  Uses context distance : weight each word in context window using its distance from the center word ●  This ensures nearby words have more influence than far off ones. ●  Sentence -> “I like NLP” ○  Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5 ○  Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0 ○  Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0 ●  Corpus C: I like NLP. I like cricket. Co-occurrence matrix for C 108
  • 108. Issues with Co-occurrence Matrix ●  Long tail distribution ●  Frequent words contribute disproportionately (use weight function to fix this) ●  Use Log for normalization ●  Avoid log 0 : Add 1 to each Xij X21 109
  • 109. Intuition for Glove ● Think of matrix factorization algorithms used in recommendation systems. ● Latent Factor models ○  Find features that describe the characteristics of rated objects. ○  Item characteristics and user preferences are described using vectors which are called factor vectors z ○  Assumption: Ratings can be inferred from a model put together from a smaller number of parameters 110
  • 110. Latent Factor models ●  Dot product estimates user’s interest in the item ○  where, qi : factor vector for item i. pu : factor vector for user u i : estimated user interest ●  How to compute vectors for items and users ? 111
  • 111. Matrix Factorization ● rui : known rating of user u for item i ●  predicted rating : ●  Similarly glove model tries to model the co-occurrence counts with the following equation : 112
  • 112. Weighting function . ● Properties of f(X) ○ vanish at 0 i.e. f(0) = 0 ○ monotonically increasing ○ f(x) should be relatively small for large values of x ●  Empirically 𝞪 = 0.75, xmax=100 works best 113
  • 113. Loss Function ●  Scalable. ●  Fast training ○  Training time doesn’t depend on the corpus size ○  Always fitting to a |V| x |V| matrix. ●  Good performance with small corpus, and small vectors. 114
  • 114. ● Input : ○ Xij (|V| x |V| matrix) : co-occurrence matrix ● Parameters ○  W (|V| x |D| matrix) & W˜ (|V| x |D| matrix) : ■  wi and wj˜ representation of the ith & jth words from W and W˜ matrices respectively. ○ bi (|V| x 1) column vector : variable for incorporating biases in terms ○ bj (1 x |V|) row vector : variable for incorporating biases in terms Training 115
  • 115. ●  Train on Wikipedia data ● |V| = 2000 ●  Window size = 3 ●  Iterations = 10000 ● D = 50 ● Learn two representations for each word in |V|. ● reg = 0.01 ● Use momentum optimizer with momentum=0.9. Quick Experiment 116
  • 116. Results - months & centuries 117
  • 122. t-SNE
  • 123. Artworks mapped using Machine Learning. Art work Mapped using t-SNE https://artsexperiments.withgoogle.com/tsnemap/#47.68,1025.98,361.43,51.29,0.00,271.67
  • 124. Objective ●  Given a collection of N high-dimensional objects x1, x2, …. xN. ●  How can we get a feel for how these objects are (relatively) arranged ? 125
  • 125. Introduction ● Build map(low dimension) s.t. distances between points reflect “similarities” in the data : ● Minimize some objective function that measures the discrepancy between similarities in the data and similarities in the map 126
  • 128. Principal component analysis ●  PCA mainly tries to preserve large pairwise distances in the map. ● Is that what we want ? 129
  • 129. Goals ●  Preserve Distances ●  Preservation Neighborhood of each point 130
  • 130. t-SNE High dimension ● Measure pairwise similarities between high dimensional objects xi xj 131
  • 131. t-SNE Lower dimension ● Measure pairwise similarities between low dimensional map points 132
  • 132. t-SNE ● We have measure of similarity of data points in High Dimension ● We have measure of similarity of data points in Low Dimension ● We need a distance measure between the two. ● Once we have distance measure, all we want is : to minimize it 133
  • 133. One possible choice - KL divergence ●  It’s a measure of how one probability distribution diverges from a second expected probability distribution 134
  • 134. KL divergence applied to t-SNE Objective function (C) ●  We want nearby points in high-D to remain nearby in low-D ○  In the case it's not, then ■  pij will large (because points are nearby) ■  but qij will be small (because points are far away) ■  This will result in larger penalty ■  In contrast, If both pij and qij are large : lower penalty 135
  • 135. KL divergence applied to t-SNE ● Likewise, we want far away points in high-D to remain (relatively) far away in low-D ○  In the case it's not, then ■  pij will small (because points are far away) ■  but qij will be large (because points are nearby) ■  This will result in lower penalty ●  t-SNE mainly preserves local similarity structure of the data 136
  • 136. t-Distributed Stochastic Neighbor Embedding ● Move points around to minimize : 137
  • 137. Why a Student t-Distribution ? ● t-SNE tries to retain local structure of this data in the map ● Result : dissimilar points have to be modelled as far apart in the map ● Hinton, has showed that student t-distribution is very similar to gaussian distribution Local structures global structure ●  Local structures preserved ●  global structure is lost 138
  • 138. Deciding the effective number of neighbours ●  We need to decide the radii in different parts of the space, so that we can keep the effective number of neighbours about constant. ●  A big radius leads to a high entropy for the distribution over neighbors of i. ●  A small radius leads to a low entropy. ●  So decide what entropy you want and then find the radius that produces that entropy. ●  It's easier to specify 2entropy ○  This is called the perplexity ○  It is the effective number of neighbors. 139
  • 140. Hyper parameters really matter: Playing with perplexity ●  projected 100 data points clearly separated in two different clusters with tSNE ●  Applied tSNE with different values of perplexity ●  With perplexity=2, local variations in the data dominate ●  With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data 141
  • 141. Hyper parameters really matter: Playing with #iterations ●  Perplexity set to 30.0 ●  Applied tSNE with different number of iterations ●  Takeaway : different datasets may require different number of iterations 142
  • 142. Cluster sizes can be misleading ●  Uses tSNE to plot two clusters with different standard deviation ●  bottomline, we cannot see cluster sizes in t-SNE plots 143
  • 143. Distances in t-SNE plots ●  At lower perplexity clusters look equidistant ●  At perplexity=50, tSNE captures some notion of global geometry in the data ●  50 data points in each sub cluster 144
  • 144. Distances in t-SNE plots ●  tSNE is not able to capture global geometry even at perplexity=50. ●  key take away : well separated clusters may not mean anything in tSNE. ●  200 data points in each sub cluster 145
  • 145. Random noise doesn’t always look random ●  For this experiment, we generated random points from gaussian distribution ●  Plots with lower perplexity, showing misleading structures in the data 146
  • 146. You can see some shapes sometimes ●  Axis aligned gaussian distribution ●  For certain values of perplexity, long cluster look almost correct. ●  tSNE tends to expands regions which are denser 147
  • 149. Why word2vec does better than others ? 150
  • 150. At heart they are all same !! ● Its has been shown that in essence GloVe and word2vec are no different from traditional methods like PCA, LSA etc (Levy et al. 2015 call them DSM ) ● GloVe ⋍ PCA/LSA is straightforward (both factorize global counts matrix) ● word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015) ● They show that in essence word2vec also factorizes word context matrix (PMI) 151
  • 151. ● Despite this “equality” of algorithm, word2vec is still known to do better on several tasks. ● Why ? ○ Levy et al. 2015 show : magic lies in Hyperparameters 152
  • 152. Hyperparameters ● Pre-processing ○  Dynamic context window ○  Subsampling frequent words ○  Deleting rare words ● Post-processing ○  Adding context words ○  Vector normalization 153
  • 153. Pre-processing ● Dynamic Context window ○  In DSM, context window: unweighted & constant size. ○  Glove & SGNS - give more weightage to closer terms ○  SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize. ● Subsampling frequent words ○  SGNS dilutes frequent words by randomly removing words whose frequency f is higher than some threshold t, with probability ● Deleting rare words ○  In SGNS, rare words are also deleted before creating context windows. 154
  • 154. Post-processing ● Adding context vectors ○  Glove adds word vectors and the context vectors for the final representation. ● Vector normalization ○  All vectors can be normalized to unit length 155
  • 155. Key Take Home ● Hyperparameters vs Algorithms ○  Hyper parameter settings is more important than the algorithm choice ○  No single algorithm consistently outperforms the other ones ● Hyperparameters vs more data ○  Training on larger corpus helps on some tasks ○  In many cases, tuning hyperparameters in more beneficial 156
  • 156. References Idea of word vectors is not new. •  Learning representations by back-propagating errors (Rumelhart et al. 1986) •  A neural probabilistic language model (Bengio et al., 2003) •  NLP from Scratch (Collobert & Weston, 2008) •  Word2Vec (Mikolov et al. 2013) • Sebastian Ruder’s 3 part Blog series • Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher • word2vec Parameter Learning Explained by X Rong 157
  • 157. •  GloVe : • https://nlp.stanford.edu/pubs/glove.pdf •  https://www.youtube.com/watch?v=tRsSi_sqXjI •  http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/ •  https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf •  t-SNE: • http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf •  http://distill.pub/2016/misread-tsne/ •  https://www.slideshare.net/ssuserb667a8/visualization-data-using-tsne •  https://youtu.be/RJVL80Gg3lA •  KL Divergence •  http://tdhopper.com/blog/2015/Sep/04/cross-entropy-and-kl-divergence/ 158
  • 158. •  Cross Entropy : •  https://www.youtube.com/watch?v=tRsSi_sqXjI •  http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/ •  Softmax: •  https://en.wikipedia.org/wiki/Softmax_function •  http://cs231n.github.io/linear-classify/#softmax •  Tensor Flow •  1.0 API docs •  CS20SI 159
  • 159. • Module 1 • Introduction • Archaic Techniques • Using pretrained embeddings • Module 2 • Introduction to embeddings • Using pretrained embeddings • Word level representation • Visualizing word embedding • Module 3 • Sentence/Paragraph/Document level representation • Skip-Thought Vectors • Module 4 • Character level representation 160
  • 161. • Document level language models Key Learning outcomes: •  Combining word vectors •  Key ideas behind document vectors •  DM, DBOW •  How are they similar/different from word vectors •  Drawbacks of these approaches •  Skip-Thought vectors •  RNNs: LSTM, GRU •  Architecture of skip-thought vectors 162 Module 3
  • 163. Story generation from images 164
  • 164. Sentence Representation Task : Train a ML model for sentiment classification. Problem : Given a sentence, predict its sentiment. Solution: 1)  Represent the sentence in mathematical format 2)  Train a model on data - sentence, label How do you represent the sentence ? we want a representation that captures the semantics of the sentence. 165
  • 165. We already have word vectors. Can we use these to come up with a way to represent the sentence ? Eg :- “the cat sat on the table” We have vectors for “the”, “cat”, “sat”, “on”, “the” & “table”. How can we use the vectors for words to get vector for sentence ? 166
  • 166. Possible Solutions Sentence (S) - “The cat sat on the table” Concatenation : Our sentence is one word followed by another. So, its representation can be - word vectors for every word in sentence in same order. Sv = [wvThe wvcat wvsat wvon wvthe wvtable] Each word is represented by a d-dimensional vector, so a sentence with k words has k X d dimensions. Problem : Different sentences in corpus will have different lengths. Most ML models work with fixed length input. 167
  • 167. Mean of word vectors: Weighted average of the word vectors Sv = 168
  • 168. Fallacies ●  Different sentences with same words but different ordering will give same vector. ○  “are you good” vs “you are good” ●  Negation - opposite meaning but very similar words ○  “I do want a car” vs “I don’t want a car” If word vectors for “do” and “don’t” are close by, then in this case their sentence vectors will also be close by. If these 2 sentences are in opposite Classes, we are in trouble. ●  Sentence vector generated via simple operations on word vectors - often do not capture syntactic and semantics properties. 169
  • 169. Motivation ●  Build vector representation at sentence/paragraph/document level such that it has the following properties : ○  Syntactic properties: ■  Ordering of words ○  Semantic properties: ■  Sentences that have the same meaning should come together. ■  Capturing negation. ○  Provide fixed length representation for variable length text. 170
  • 170. Solution ●  Doc2Vec* ○  Distributed Memory (DM) ○  Distributed Bag Of Words (DBOW) ●  We will study these 2 methods to learn a representation for text at paragraph level. However, this is applicable directly at sentence and document level too. * Le, Quoc; et al. "Distributed Representations of Sentences and Documents" 171
  • 171. Distributed Memory (DM) ●  We saw that word2vec uses context words to predict the target word. ●  In distributed memory model, we simply extend the above idea - we use paragraph vector along with context word vectors to predict the next word. ●  S = “The cat sat on the table” ●  (Sv , wvThe, wvcat, wvsat) wvon 172
  • 172. Architecture * Le, Quoc; et al. "Distributed Representations of Sentences and Documents" D Para2vec Matrix W word2vec matrix ddv |N| dW |V| 173
  • 173. Details ●  Each document is represented by a ddv dimensional vector. ●  Each word is represented by dw dimensional vector. ●  Index the vectors for document d and word w1, w2 & w3 (i.e. The, cat & sat) ●  These vectors are then combined (concatenate/average) for predicting next word (w4) in document. 174
  • 174. Details ●  Objective of word vector model. ●  Prediction is obtained through multi class classification. ●  Each of yi is un-normalized log-probability for each output word i. ●  where U, b are the softmax parameters. h is constructed by a concatenation or average of word vectors extracted from W. ●  Cross entropy loss function is used to learn the representation of the word and each document vector. 175
  • 175. Generating representation at test time Sentence : “I got back home.” * Le, Quoc; et al. "Distributed Representations of Sentences and Documents" 176
  • 176. Distributed Bag of words(DBOW) ●  We saw that word2vec uses target word to predict the context words. ●  In dbow model, we simply extend the above idea - we use paragraph vector to predict the words. ●  S = “The cat sat on the table” (Sv ) (wvThe, wvcat, wvsatwvon ) 177
  • 177. Architecture ●  Words and the ordering of the words uniquely define a paragraph. ●  Reversing this : a paragraph uniquely defines the words and their ordering present in the paragraph. ●  Thus, given a paragraph representation, we should be able to predict the words in the paragraph ●  This is precisely what DBOW does. 178
  • 178. DBOW ●  Each document is represented by a ddv dimensional vector. ●  Softmax layer outputs a |V| dimensional vector (this is nothing but probability distribution over words). ●  Essentially, we are trying to learn a document representation ddv which can predict the words in any window on the document. 179
  • 179. Details ●  Random windows are samples from each document. ●  Document vector is used to make a prediction for words in this window. ●  Cross entropy loss function is used to learn the representation of the word and each document vector. 180
  • 180. Generating representation at test time Sentence : “I got back home.” 181
  • 181. Evaluation •  Paragraph vec + 9 words to predict 10th word •  Input: Concatenates 400 dim. DBOW and DM vectors. •  Predicts test-set paragraph vec’s from frozen train-set word vec’s Stanford IMDB movie review data set * Le, Quoc; et al. "Distributed Representations of Sentences and Documents" 182
  • 183. Sentence Similarity Input sentence - “Distributed Representations of Sentences and Documents” 184
  • 184. LDA vs para2vec Terms similar to “machine learning” 185
  • 185. Drawbacks ●  Inference needs to be performed at test time, for generating vector representation of a sentence in test corpus. ●  This scales poorly for application which incorporate large amount of text. 186
  • 186. Drawbacks ●  Inference needs to be performed at test time, for generating vector representation of a sentence in test corpus. ●  This scales poorly for application which incorporate large amount of text. 187
  • 187. Hacker’s way for quick implementation Gensim notebook gensim notebook Tensor Flow Implementation 188 Tensorflow implementation
  • 189. Motivation ●  Although various techniques exist for generating sentence and paragraph vector, there is lack of generalized framework for sentence encoding. ●  Encode a sentence based on its neighbour( encode a sentence and try to generate to two neighbouring sentences in the decoding layer). ●  Doc2vec require to perform explicit inference in order to generate the vector representation of sentence at test time. 190
  • 190. Introduction to skip-thoughts ●  word2vec skip gram model applied at sentence level. ●  Instead of using a word to predict its surrounding words, use a sentence to predict their surrounding sentences. ●  Corpus : I got back home. I could see the cat on the steps. This was strange. si-1 : I got back home. si : I could see the cat on the steps. si+1 : This was strange. 191
  • 191. Introduction to skip-thoughts ● need ml model that can (sequentially) consume variable length sentences ● And after consumption used the knowledge gained from whole sentence to predict the neighbouring sentences ● FFN, CNN cannot neither consume sequential text nor have any persistence 192
  • 192. RNN ●  Motivation: How do humans understand language ○  “How are you ? Lets go for a coffee ? ...” ●  As we read from left to right, we don’t understand each word in isolation, completely throwing away previous words. We understand each word in conjunction with our understanding from previous words. ●  Traditional neural networks (FFNs, CNNs) can not reason based on understanding from previous words - no information persistence. 193
  • 193. RNN ●  RNN are designed to do exactly this - they have loops in them, allowing information to persist. ●  In the above diagram, A, looks at input xt and produces hidden state ht. A loop allows information to be passed from one step of the network to the next. Thus, using x0 to xt-1 while consuming xt. Image borrowed from Christopher Olah’s blog 194
  • 194. ●  To better understand the loop in RNN, let us unroll it. Time ●  The chain depicts information(state) being passed from one step to another. ●  Popular RNNs = LSTM, GRU Image borrowed from Christopher Olah’s blog 195
  • 195. 196
  • 196. 197
  • 197. In CNN we have parameters shared across space. In RNN parameters are shared across time 198
  • 198. Architecture of RNN ●  All RNNs have a chain of repeating modules of neural network. ●  In basic RNNs, this repeating module will have a very simple structure, such as a single tanh layer. Image borrowed from Christopher Olah’s 199
  • 199. Image borrowed from suriyadeepan’s The state consists of a single “hidden” vector h h h h h 200
  • 200. The Dark side ●  RNN's have difficulty dealing with long-range dependencies. ●  “Nitin says Ram is an awesome person to talk to, you should definitely meet him”. ●  In theory they can “summarize all the information until time t with hidden state ht” ●  In practice, this is far from true. 201
  • 201. ●  This is primarily due to deficiencies in the training algorithm - BPTT (Back Propagation Through Time) ●  Gradients are computed via chain rule. So either the gradients become: ○  Too small (Vanishing gradients) ■  Multiplying n of these small gradients (<1) results in even smaller gradient. ○  Too big (Exploding gradients) ■  Multiplying n of these large gradients (>1) results in even larger gradient. 202
  • 202. LSTM ●  LSTMs are specifically designed to handle long term dependencies. ●  The way they do it is using cell memory: The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called “gates”. ●  Gates control what information is to be added or deleted. 203
  • 203. ●  “forget gate” decides what information to throw from cell state. ●  It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.” Image borrowed from Christopher Olah’s 204
  • 204. ●  “input gate” decides which values in cell state to update. ●  tanh layer creates candidate values which may be added to the state Image borrowed from Christopher Olah’s 205
  • 205. ●  “forget gate” & “input gate” come together to update cell state. Image borrowed from Christopher Olah’s 206
  • 206. ●  “output gate” decides the output. Image borrowed from Christopher Olah’s 207
  • 207. ●  There are many variants. ●  Each variant has some gates that control what is stored/deleted. ●  At heart of any LSTM implementation are these equations. ●  By making memory cell additive, they circumvent the problem of diminishing gradients. ●  For exploding gradients - use gradient clipping. 208
  • 208. GRU ●  GRU units are simplification of LSTM units. ●  Gated recurrent units have 2 gates. ●  GRU does not have internal memory ●  GRU does not use a second nonlinearity for computing the output 209
  • 209. Details ●  Reset Gate ○  Combine new input with previous memory. ●  Update Gate ○  How long the previous memory should stay. 210
  • 210. LSTM & GRU Benefits ●  Remember for longer temporal durations ●  RNN has issues for remembering longer durations ●  Able to have feedback flow at different strengths depending on inputs 211
  • 211. Visual difference between LSTM & GRU 212
  • 212. Encoding ●  Let x1, x2, … xN be the words in sentence si, where N is the number of words. ●  Encoder produces an output representation at time step t, which is the representation of the sequence x1, x2, ...xt. ●  Hidden state hi N is the output representation of the entire sentence. 213
  • 213. Encoding Corpus : I got back home. I could see the cat on the steps. This was strange. 214
  • 214. Decoding ●  Decoder conditions on the encoder output hi. ●  One decoder is used for next sentence, while another decoder is used for the previous sentence. ●  Decoders share the vocabulary V, but learn the other parameters separately. 215
  • 216. Details ●  Given ht i+1, the probability of word wt i+1 given the previous t − 1 words and the encoder vector is ●  where, denotes the row of V corresponding to the word of wt i+1 ●  Similar computation is performed for the previous sentence st-1 217
  • 217. Objective Function ●  Given a tuple (si−1, si , si+1), the objective is the sum of the log-probabilities for the forward(si+1) and backward(si-1) sentences conditioned on the encoder representation: ●  The total objective is the above summed over all such training tuples. 218
  • 218. Nearest Neighbour through skip-thoughts 219
  • 220. References ●  Doc2vec ○  Distributed Representations of Sentences and Documents ○  Medium article ○  Doc2vec tutorial ○  Document Embedding with Paragraph Vectors ○  https://deeplearning4j.org/doc2vec ○  https://groups.google.com/forum/#!topic/gensim/0GVxA055yOU ○  https://amsterdam.luminis.eu/2016/11/15/machine-learning-example/ ○  https://github.com/wangz10/tensorflow-playground/blob/master/doc2vec.py ○  https://blog.acolyer.org/2016/06/01/distributed-representations-of-sentences-and-documents/ ○  https://deeplearning4j.org/doc2vec 221
  • 221. ●  Skip-thoughts o  Skip-Thought Vectors o  https://github.com/ryankiros/skip-thoughts o  https://www.intelnervana.com/building-skip-thought-vectors-document-understanding/ o  https://gab41.lab41.org/lab41-reading-group-skip-thought-vectors-fec68c05aa92 222
  • 224. Topics: • Drawbacks of doc2vec • Character level language modeling Key Learning outcomes: •  Character based language models •  RNNs - LSTM, GRU •  Magic : RNN + char2vec •  Extending skipgram, CBOW to characters •  Tweet2vec •  Basics of CNN •  charCNN 225
  • 226.
  • 227. Drawbacks ●  Until now we built language models at word/sentence/paragraph/document level. ●  There are couple of major problems with them: ○  Out Of Vocabulary (OOV) - how to handle missing words ? ○  Low frequency count - Zipf’s Law tells us that in any natural language corpus a majority of the vocabulary word types will either be absent or occur in low frequency. ○  Blind to subword information - “event”, “eventfully”, “uneventful”, “uneventfully” should have structurally related embeddings. 228
  • 228. ○  Each word vector is independent - so you may have vectors for “run”, “ran”, “running” but there is no (clean) way to use them to obtain vector for “runs”. Poor estimate of unseen words. ○  Storage space - have to store large number word vectors. English wikipedia contains 60 million sentences with 6 billion tokens of which ~ 20 million are unique words. This is typically countered by capping the vocabulary size. ○  Generative models: Imagine you feed k words/sentences to the model, and ask it to predict (k+1)st word/sentence. ■  How well is such a model likely to do ? ■  Badly ■  Why ? ■  Large output space. 229
  • 229. Way forward ●  Construct vector representation from smaller pieces: ○  Morphemes: ■  Meaningful morphological unit of a language that cannot be further divided (e.g. for ‘incoming’ morphemes are : in, come, ing) ■  Ideal primitive. By definition they are minimal meaning bearing units of a language. ■  Given a word, breaking it into morphemes is non-trivial. ■  Requires morphological tagger as preprocessing step (Botha and Blunsom 2014; Luong, Socher, and Manning 2013) ○  Characters: ■  Fundamental unit ■  Easy to identify ■  How character compose to give meaning is not very clear. “Less”, “Lesser”, “Lessen”, “lesson” ■  Most languages have a relatively small character set - 230
  • 230. ●  For the rest of this presentation, we will treat text as a sequence of characters - feeding 1 character at a time to our model. ●  For this we need models that are capable of taking and processing sequences. FFN, CNN ●  RNN - Recurrent Neural Networks ○  LSTM ○  GRU 231
  • 231. ●  Imagine we are working with english language. ●  Roughly ~70 unique characters. ●  Easiest character embedding - 1 hot vectors in 70 dimension space. ●  Every 2 characters are equally distant(near by). Is there any use of such embedding ? YES Simplest char2vec 232
  • 232. Unreasonable effectiveness of RNN* ●  Blog by Andrej Karpathy in 2015 ●  Demonstrated the power of character level language models. ●  Central problem: Given k (continuous) characters (from a text corpora), predict (k+1)st character. ●  Very very interesting results * karpathy.github.io/2015/05/21/rnn-effectiveness/ 233
  • 233. ●  Shakespeare’s work ●  Linux Source Code 234
  • 234. ●  Algebraic Geometry ●  NSF Research Awards abstracts 235
  • 235. char2vec : Toy Example Example training sequence: “hello” Vocabulary: [h,e,l,o] 236
  • 236. Let’s implement it ! ●  Take input text (say Shakespeare’s novels), and using a sliding window of length (k+1) slice the raw text in contiguous chunks of (k+1) characters ●  Split each chunk into (X,y) pairs where first k characters become X and (k +1)th character is the y. This becomes our training data. 237
  • 237. ●  Map each character to a unique id ●  Say we have d unique characters in our corpus ●  Each character is a vector of d dimensions in 1-hot format ●  A sequence of k characters is : 2d tensor of k x d ●  Dataset X is : 3d tensor of m sequences, each of k x d ●  Y is 2d tensor : m x d. Why ? k 0 0 0 0 1 d 0 0 0 0 1 k d m 238
  • 238. ●  We will use keras ●  A super simple library on top of TF/Theano ●  Meant for both beginners and advanced. ●  Exceptionally useful for quick prototyping. ●  Super popular on kaggle Almost there …. 239
  • 240. Some more awesome applications of char2vec Writing with machine DeepDrumpf 241
  • 241. Similar idea applied via CNN ●  Similarly Zhang et al. have applied CNN instead of RNN directly to 1-hot vectors character vectors. “Text Understanding from Scratch” Xiang Zhang, Yann LeCun “Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Junbo Zhao, Yann LeCun 242
  • 242. Dense char2vec ●  1-hot encoding of characters is fairly straight forward and very useful. ●  But people have shown learning a dense character level representation can work even better (improved results or similar results with lesser params). ●  Also results in lesser parameters in input layer and and its subsequent layer. (though not much) (# of edges between embedding layer and next laye) ●  Simplest way to learn dense character vectors ? 243
  • 243. CBOW & SkipGram ●  Original CBOW and Skip-Gram were based on words. ●  Use the same architecture, but character level i.e. ○  CBOW = given characters in context, predict the target character ○  Skip Gram = given target character, predict characters in context 244
  • 244. We have given the notebook for character level skip-gram. Notebook for character level CBOW : take home assignment ! 245
  • 245. How good is the embedding ? ●  Word vectors or document vectors are evaluated using both intrinsic and extrinsic evaluation. ●  Character vectors have only extrinsic evaluation. ●  Makes no sentence to say something like r : s :: a : b ●  Even from human perspective, a character has no meaning on its own. ●  Building character embedding is relatively cheap, hence most tasks specific architectures have this component built into them. Man : King :: Woman : Queen Sentiment analysis 246
  • 246. Tweet2Vec* ●  Twitter - Informal language, slang, spelling errors, abbreviations, new and ever evolving vocabulary, and special characters. ●  For most twitter corpuses : size of vocabulary is ~30-50% of number of documents. ●  Can not use word level approaches - very large vocabulary size. ●  Not only this makes it practically infeasible but also affects the quality of word vectors. Why ? * Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al. 247
  • 247. Task ●  Given a tweet, predict its hashtag. ●  “Shoutout to @xfoml Project in rob wittig talk #ELO17” ●  Super easy to collect a dataset. 248
  • 248. Designing N/W ●  raw characters character embedding bi-directional GRU ●  Why bi-directional GRU (BGRU) ? ○  Language is not just a forward sequence. ○  “He went to ___?___” ○  “He went to ___?___ to buy grocerry” ○  Its both past words and future words that determine the missing word. ○  (BGRU) exploits this - it has 2 independent GRU networks. One consumes text in forward direction while other in backward direction. 249
  • 249. Architecture Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al. 250
  • 250. Loss function ●  Final tweet embedding is used to produce score for every hashtag. ●  Scores are converted to probability using softmax ●  This gives a distribution over hashtags. ●  This is compared against true distribution. ●  Cross entropy is used to measure the gap between 2 distributions. ●  This is loss function(J) 251
  • 251. Results Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al. 252
  • 252. Results Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al. Rank of predicted hashtag 253
  • 254. Char CNN ●  Convolutional Neural Nets (CNN)* have been super successful in the area of vision. ●  CNN treats image as a signal in spatial domain. ●  Can it be applied to text ? Yes ○  Text = stream of characters ○  Since characters come one after another - this is signal in time domain ○  Embedding matrix as input matrix* LeNet-5 by Yann LeCun Pixels spread in space. Position of each pixel is fixed. Changing that will change the image Characters spread in time. Position (1d) of each character is fixed. Changing that will change the sentence. 255
  • 255. Basics of CNN ●  Input : Image ●  Image is nothing but a signal in space. ●  Represented by matrix with values (RGB) ●  Each value ~ wavelength of Red, Green and Blue signals respectively. Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al. 256
  • 256. CNN architecture ●  2 key operations: ○  Convolution ○  Pooling 257
  • 257. ●  In simplest terms : given 2 signals x() and h(), convolution combines the 2 signals: ●  In the discrete space: ●  For our case image is x() ●  h() is called filter/kernel/feature detector. Well known concept in the world of image processing. Convolution 258
  • 258. ●  Ex: Filters for edge detection, blurring, sharpen, etc ●  It is usually a small matrix - 3x3, 5x5, 5x7 etc ●  There are well known predefined filters https://en.wikipedia.org/wiki/Kernel_(image_processing) 259
  • 259. 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 4 1*1 + 1*0 + 1*1 0*0 + 1*1 + 1*0 0*1 + 0*0 + 1*1 ●  Convolved feature is nothing but taking a part of the image and applying filter over it - taking pairwise products and adding them. 260
  • 260. ●  Convolved feature map is nothing but sliding the filter over entire image and applying convolution at each step, as shown in diagram below: 1 0 1 0 1 0 1 0 1 https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in- Filter 261
  • 261. ●  Image processing over past many decades has built many filters for specific tasks. ●  In DL (CNN) rather than using predefined filters, we learn the filters. ●  We start with small random values and update them using gradients ●  Stride: by how much we shift the filter. ? ? ? ? ? ? ? ? ? 262
  • 262. ●  It’s a simple technique for down sampling. ●  In CNNs, downsampling, or "pooling" layers are often placed after convolutional layers. ●  They are used mainly to reduce the feature map dimensionality for computational efficiency. This in turn improves actual performance. ●  Takes disjoint chunks of the image (typically 2×22×2) and aggregates them into a single value. ●  Average, max, min, etc. Most popular is max-pooling. Pooling https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html 263
  • 263. Putting it all together https://adeshpande3.github.io 264
  • 264. Deep Learning + Language Modeling •  Traditionally uses architecture such as Recurrent Neural Networks (RNN). •  Sequential processing : one unit after other. •  Over time advancements happened and concepts like : 2 way ordering (Bidirectional), memory(LSTM), attention etc got added. •  Some people explored the possibility of using CNN for Language modeling: •  Pixels spread in space. So they are nothing but signal in space. •  Words/tokens/characters spread in time. So they are nothing but signal in time. 265
  • 265. CNNs for Language Modeling 266
  • 266. •  Input for any NLP task are sentences/paras/docs in the form of matrix •  Each row of this matrix represents a unit/token of text – character, morpheme, word etc (typically row = 1-hot or embedding representation of that unit) •  Unlike images, where filter slides over local patches of an image; in NLP we typically use filters that slide over full rows of the matrix i.e. the “width” of our filters is usually the same as the width of the input matrix. [1D or temporal convolutions] •  The height, or region size varies. Typically, window slides over 2-5 words at a time. 267
  • 267. 268
  • 268. •  Lots of success of CNNs is attributed to : •  Location Invariance : where a object in a image comes doesn’t matter so much •  Local Compositionality : bunch of local objects combine/compose to give more complex objects. 269
  • 269. •  In CNN+NLP, both aforementioned properties go for a toss •  Where a word comes in a sentence can change the meaning drastically. ○  Man bites dog. Dog bites man. •  Parts of phrases could be separated by several other words. Words do compose in some ways, but how exactly this works, what higher level representations actually “mean” – these aren’t as obvious as in the Computer Vision case. ○  “Tim said Robert has lot of experience, he feels you should definitely meet him” •  Both key advantages gone, why are we even thinking of applying CNNs to text ? RNNs should be the way to go. 270
  • 270. •  “All models are wrong, but some are useful” •  This is not about CNNs vs RNNs (may be both are bad!) •  This is about •  Understanding key difficulties •  Are there some aspects of language modeling where CNNs can do a better job. •  Helps us to better understand strength & weakness of each model. •  Turns out that CNNs applied to certain NLP problems perform quite well. Esp classification tasks - Sentiment Analysis, Spam Detection or Topic Categorization. •  CNNs are usually fast, very fast. 271
  • 271. Major works in this sub-area •  Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP 2014 •  Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. COLING-2014 •  Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. CIKM ’14. •  Santos, C., & Zadrozny, B. (2014). Learning Character-level Representations for Part-of- Speech Tagging. ICML-14. •  Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification, 1–9. •  Wenpeng Yin, Hinrich Schutze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: attention-based convolutional neural network for modeling sentence pairs. 272
  • 272. •  Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hinrich Schutze. 2016. Combining recurrent and convolutional neural networks for relation classification. In Proceedings of NAACL HLT. pages 534–539. •  Ying Wen, Weinan Zhang, Rui Luo, and Jun Wang.2016. Learning text representation using recurrent convolutional neural network with highway layers. SIGIR Workshop on Neural Information Retrieval •  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083 •  Wenpeng Yin, Katharina Kann, Mo Yu and Hinrich Schutze Comparative Study of CNN and RNN for Natural Language Processing •  Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-Aware Neural Language Models. (Uses a hybrid of CNN and RNN) 273
  • 274. Character-Aware Neural Language Models * •  Problem statement: Given t words w1, w2, ….., wt ; predict wt+1 •  Traditional models : words fed as inputs in the form of word embedding. •  Here input embedding is replaced by output of character level CNN. •  Uses sub word information. •  Traditionally sub word information is fed in terms of morphemes; ●  Unbreakable : Un ("not") – break (root word) – able (“can be done”) * “Character-Aware Neural Language Models” Y kim et. al 2015 275
  • 275. •  Identifying morphemes is non trivial. Requires morphological tagging as preprocessing. •  Y Kim et. al leverage sub word via through a character-level CNN. •  Learn embedding for each character. •  A word w is then nothing but embeddings of it constituent characters. •  For each word, we apply convolution on its character embeddings to obtain features. •  These are then fed to LSTM via highway layers. •  Does not use word embeddings at all. •  In most language models, large % of parameters are because of word embeddings. Thus, we get much smaller number of parameter to learn. 276
  • 276. Details ●  C - vocabulary of characters. ●  D - dimensionality of character embeddings. ●  R - matrix character embeddings. ● Let word wk = [c1,....,cl] i.e. made from l characters, where l is length of wk ● Character-level representation of wk is given by matrix ● Ck ∈  ℝ D X l, where jth column corresponds to character embedding for jth character of word wk ●  Apply filter/kernel H to Ck to obtain feature map fk. ●  ith element of fk is given by: ●  is not : ith to (i-w+1)th columns of Ck ●  is called Frobenius product |C| D R l D Ck c1 c2 cl l - w +1 fk 277
  • 277. •  To capture most important feature - we take max over time ●  yk is the feature corresponding to filter H when applied to word wk. ●  (~ find most important character n-gram) •  Likewise, they apply multiple h filters : H1, …., Hh. •  Then, yk = is the input representation of word wk. ●  At this point of time we can either: •  Construct MLP over yk •  Feed yk to LSTM 278
  • 278. ● Instead to gain improvements, rather than feeding yk to LSTM, they pass it via Highway network* ●  Highway network: ● Basic idea: carry some part input directly to output. ● While remaining input is processed and then taken forward. ● Very similar to residual networks. ● F() is typically : affine transformation followed by tanh. ● In Highway networks, we learn “what parts of input to be ● carried forward via highway” ● This is done via gating mechanism called transform gate (t) and carry gate (1-t) 279
  • 281. Key take home •  CNNs + NLP surely holds lot of promise. •  Pretty successful in classification setting. •  Can prove be great tool to model the input aspects of NLP. •  What about non-classification settings ? •  Sequence labeling (NER) •  Sequence generation (MT) •  As of today not so successful •  Though people have tried lot of ideas there too. •  de-convolutions in generative settings •  Some architectures use different embeddings as different channels. 282
  • 282. More Resources •  https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/ •  https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f •  wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ •  https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with- convolutional-neural-networks-on-microsoft-azure/ •  https://www.aclweb.org/anthology/P/P14/P14-1062.xhtml •  https://github.com/yoonkim/lstm-char-cnn •  https://github.com/yoonkim/CNN_sentence •  https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32 •  “Comparative Study of CNN and RNN for Natural Language Processing” Wenpeng Yin et. al 2017, arXiv: 1702.01923 [cs.CL] 283
  • 283. 284
  • 284. References ●  Tweet2vec: ○  “Character-based Neural Embeddings for Tweet Clustering” - Vakulenko et. al ○  “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network” - Sakaguchi et. al ●  Basics of CNN ○  https://adeshpande3.github.io 285
  • 285. ●  CNN on text: ○  https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f ○  https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word- embedding-and-convolutional-neural-networks-on-keras-163197aef623 ○  Seminal paper - “Convolutional Neural Networks for Sentence Classification” Y kim ○  “Text Understanding from Scratch” Xiang Zhang, Yann LeCun ○  “Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Yann LeCun ○  “Character-Aware Neural Language Models” Y kim ●  Character Embeddings: ○  “Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Yann LeCun ○  “Character-Aware Neural Language Models” Y kim ○  “Exploring the Limits of Language Modeling” Google brain team. ○  “Finding Function in Form: Compositional Character Models for Open Vocabulary Word 286
  • 286. Summary ● We learnt various ways to build representation at : ○  Word level ○  sentence/paragraph/document level ○  character level ● We discussed the key architectures used in representation learning and fundamental ideas behind them. ● Core idea being : context units and target units. ● We also saw strengths and weaknesses of each of these ideas. 287
  • 287. ● Start with pretrained embeddings. This serves as baseline. ● Use rigorous evaluation - both intrinsic and extrinsic. ● If you have lot of data, fine tuning pretrained embeddings can improve performance on extrinsic task. ● If your dataset is small - worth trying GloVe. Don’t try fine tuning. ● Embeddings and task are closely tied. An embedding that works beautifully for NER might fail miserably for sentiment analysis. ○  “It was a great movie” ○  “Such a boring movie” If you are training word vectors and in your corpus “great” and “boring” come in similar context, then their vectors will be closer in embedding space. Thus, they may be difficult to separate. 288
  • 288. ● Hyperparameter matter : many a times key distinguisher. ● Character embeddings are usually task specific. Thus, they often tend to do better. ● However, character embeddings can be expensive to train. ● Building blocks are same: new architectures can be build using the same principles. ● State of the art (for practitioners) - FastText from facebook. ○  trains embeddings for character n-gram ○  Character n-gram(“beautiful”) : {“bea”, “eau”, “aut”, ………} 289
  • 289. •  Please upvote the repo •  Run the notebooks. Play, experiment with them. Break them. •  If you come across any bug, please open a issue on our github repo. •  Want to contribute to this repo, great ! Pls contact us •  https://github.com/anujgupta82/Representation-Learning-for-NLP Thank You 290