SlideShare ist ein Scribd-Unternehmen logo
1 von 117
Representation Learning of
Text : Word Vectors
Anuj Gupta
Satyam Saxena
@anujgupta82, @Satyam8989
anujgupta82@gmail.com, satyamiitj89@gmail.com
Outline
• Session 1
•Introduction
•Bigram model
•Skip Gram model
•CBOW model
•Evaluation
•Speed Up
•Session 2
•Glove
•T-SNE
•Secret Ingredients
2
Introduction
Example of NLP tasks :
Easy
• Spell Checking
• Keyword Search
• Finding Synonyms
Medium
• Parsing information from websites, documents, etc.
3
4
Hard
• Machine Translation (e.g. Translate Chinese text to English)
• Semantic Analysis (What is the meaning of query statement?)
• Co-reference (e.g. What does "he" or "it" refer to given a document?)
• Question Answering (e.g. Answering Jeopardy questions).
The first and arguably most important common denominator across
all NLP tasks is : how we represent text as input to our models.
• Machine does not understand text.
• We need numeric representation
• An integral part of any NLP pipeline.
• Unlike images (RGB matrix), for text there is no obvious way.
Legacy Techniques*
• Bag of words
• N-gram
• TF-IDF
5* Details in appendix
Bottom Line
• More often than not, how rich your input representation is has huge bearing
on the quality of your downstream ML models.
• For NLP, archaic techniques treat words as atomic symbols. Thus every 2
words are equally apart.
• They don’t have any notion of either syntactic or semantic similarity
between parts of language.
• This is one of the chief reasons for poor/mediocre performance of NLP
based models.
But this has changed dramatically in past few years
6
Distributional & Distributed Representations
7
Distributional representations
• Linguistic aspect.
• Based on co-occurrence/ context
• Distributional hypothesis: linguistic units with similar distributions
have similar meanings.
• The distributional property is usually induced from document or
context or textual vicinity (like sliding window).
8
Distributed representations
• Compact, dense and low dimensional representation.
• Differs from distributional representations as the constraint is to seek
efficient dense representation, not just to capture the co-occurrence
similarity.
• Each single component of vector representation does not have any
meaning of its own.
• The interpretable features (for example, word contexts in case of
word2vec) are hidden and distributed among uninterpretable vector
components.
9
• Embedding: Mapping between space with one dimension per linguistic
unit (word, character, phrase, sentence, document ) to a continuous vector
space with much lower dimension.
“You shall know a word by the company it keeps” - J R Firth
• One of the most successful ideas of modern statistical NLP
10
Global Matrix Factorization
11
Co-occurrence with SVD
• Define a word using the words in its context.
• Words that co-occur
• Building a co-occurrence matrix M.
Context = previous word and
next word
Corpus ={“I like deep learning.”
“I like NLP.”
“I enjoy flying.”}
12
• Imagine we do this for a large
corpus of text
• row vector xdog describes usage
of word dog in the corpus
• can be seen as coordinates of
point in n-dimensional
Euclidean space Rn
• Reduce dimensions using SVD =
M
13
• Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n
• M = U Σ VT
• U is an m × m orthogonal matrix (UUT = I)
• Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1 ≥
σ2 ≥ · · · ≥ σr ≥ 0, where r = min(m, n)) [σi’s are known as singular values]
• V is an n × n orthogonal matrix (VVT = I)
• We construct M’ s.t. rank(M’) = k
• We compute M’ = U Σ’ V, where Σ’ = Σ with k largest singular values
• k captures desired percentage variance
• Then, submatrix U v,k is our desired word embedding matrix.
14
Result of SVD based Model
K = 2 K = 3
15
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
16
Pros & Cons
+ Simple method
+ Captures some sense (though weak) of similarity between words.
- Matrix is extremely sparse.
- Quadratic cost to train (perform SVD)
- Drastic imbalance in frequencies can adversely impact quality of
embeddings.
- Adding new words is expensive.
Take home : we worked with statistics of the corpus rather than working with
the corpus directly. This will recur in GloVe
17
BiGram Model
Idea: Directly learn low-dimensional word vectors ?
18
Language Models
• Filter out good sentences from bad ones.
• Good = semantically and syntactically correct.
• Modeled this via probability of given sequence of n words
Pr (w1, w2, ….., wn)
• S1 = “the cat jumped over the dog”, Pr(S1) ~ 1
• S2 = “jumped over the the cat dog”, Pr(S2) ~ 0
19
Unary Language Models
20
Binary Language Models
21
BiGram Model
• Objective : given wi , predict wi+1
• Training data: given sequence of n words < w1, w2, ….., wn >, extract bi-gram
pairs (wi-1 , wi)
• Knowns:
• input – output training examples : (wi-1 , wi)
• Vocab of training corpus (V) = U (wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
• Model : shallow net
22
wi-1
wi
Scoringlayer
Softmaxlayer
Architecture
23
• Feed index of wi-1 as input to network.
• Use index to lookup embedding matrix.
• Perform affine transform on word embedding to get a score vector.
• Compute probability for each word.
• Set 1-hot vector of wi as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
24
Softmax
25
Cross Entropy
26
27
●Per word, we have 2 vectors :
1. As row in Embedding layer (E)
2. As column in weights layer (used for afine transformation)
●It’s common to take average of the 2 vectors.
●It’s common to normalise the vectors. Divide by norm.
●An alternative way to compute ŷi : # (wi, wi-1) / # (wj, wi-1) ∀ j∈V
●Use co-occurrence matrix to compute these counts.
Remarks
I learn best with toy code,
that I can play with.
- Andrew Trask
jupyter notebook 1
28
CBOW
SkipGram
29
CBOW
• Continuous Bag of words.
• Proposed by Mikolov et al. in 2013
• Conceptually, very similar to Bi-gram model
• In the bigram model, there were 2 key drawbacks:
1. The context was very small – we took only wi-1 , while predicting wi
2. Context is not just preceding words; but following words too.
30
• “the brown cat jumped over the dog”
Context = the brown cat over the dog
Target = jumped
• Context window = k words on either side of the word to be
predicted.
• Pr (w1, w2, ….., wn) = ∏ Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k)
• W = total number of unique windows
• Each window is sliding block 2c+1 words
31
CBOW Model
• Objective : given wc−k, . . . , wc−1, wc+1, . . . , wc+k , predict wc
• Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract context and target (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
• Knowns:
• input – output training examples : (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
• Vocab of training corpus (V) = ∪(wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
32
Architecture
33
• Feed indexes of (x(c−k) , ... , x(c−1) , x(c+1) , ... , x(c+k)) for the input context of size
k.
• Use indexes to lookup embedding matrix.
• Average these vectors to get vˆ = (vc−k+vc−1+...+vc+1+vc+k ) / 2m
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
34
Maths behind the scene
• Optimization objective J = - log Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k)
• Maximizing Pr() = Minimizing – log Pr()
• Let vˆ = (wc−k + . . . + wc−1 + wc+1 + . . . + wc+k )/ 2m
• Then, RHS
• gradient descent to update all relevant word vectors uc and wj.
35
Skip-Gram model
• 2nd model proposed by Mikolov et al. in 2013
• Turns CBOW over its head.
• CBOW = given context, predict the target word
• Skip Gram = given target, predict context
• “the brown cat jumped over the dog”
Target = jumped
Context = the, brown, cat, over, the, dog
36
• Objective : given wc , predict wc−k, . . . , wc−1, wc+1, . . . , wc+k
• Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract target and context pairs (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
• Knowns:
• input – output training examples : (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
• Vocab of training corpus (V) = ∪ (wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
37
Architecture
38
• Feed index of xc
• Use index to lookup embedding matrix.
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
39
Maths behind the scene
• Optimization objective J = - log Pr(wc−k, . . . , wc−1, wc+1, . . . , wc+k | , wc)
• gradient descent to update all relevant word vectors uc and wj.
40
Evaluating Word vectors
41
• How to quantitatively evaluate the quality of word vectors?
• Intrinsic Evaluation :
• Word Vector Analogies
• Extrinsic Evaluation :
• Downstream NLP task
42
Intrinsic Evaluation
• Specific Intermediate subtasks
• Easy to compute.
• Analogy completion:
• a:b :: c:? d =
man:woman :: king:?
• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
• Discarding the input words from the search!
• Problem: What if the information is there but not linear?
43
44
Extrinsic Evaluation
• Real task at hand
• Ex: Sentiment analysis.
• Not very robust.
• End result is a function of whole process and not just embeddings.
• Process:
• Data pipelines
• Algorithm(s)
• Fine tuning
• Quality of dataset
45
Speed Up
46
Bottleneck
• Recall, to calculate probability, we use softmax. The denominator is
sum across entire vocab.
• Further, this is calculated for every window.
• Too expensive.
• Single update of parameters requires to iterate over |V|. Our vocab
usually is in millions.
47
To approximate probability, dont use the entire vocab.
There are 2 popular line of attacks to achieve this:
•Modify the structure the softmax
•Hierarchical Softmax
• Sampling techniques : don’t use entire vocabulary to compute the sum
• Negative sampling
48
● Arrange words in vocab as leaf units of a
balanced binary tree.
● |V| leaves |V| - 1 internal nodes
● Each leaf node has a unique path from root to
the leaf
● Probability of a word (leaf node Lw) =
Probability of the path from root node to leaf Lw
● No output vector representation for words,
unlike softmax.
● Instead every internal node has a d-dimension
vector associated with it - v’n(w, j)
Hierarchical Softmax
n(w, j) means the j-th unit on the path from root to the
word w
● Product of probabilities over nodes in the path
● Each probability is computed using sigmoid
●
● Inside it we check : if (j+1)th node on path left child of jth node or not
● v’n(w, j)
T h : vector product between vector on hidden layer and vector for the
inner node in consideration.
● p(w = w2)
● We start at root, and navigate to leaf w2
●
●
● p(w = w2)
●
Example
● Cost: O(|V|) to O(log |V| )
●In practice, use Huffman tree
Negative Sampling
●Given (w, c) : word and context
●Let P(D=1|w,c) be probability that (w, c) came from the corpus data.
●P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data.
● Lets model P(D=1|w,c) with sigmoid:
●Objective function (J):
○ maximize P(D=1|w,c) if (w, c) is in the corpus data.
○ maximize P(D=0|w,c) if (w, c) is not in the corpus data.
●We take a simple maximum likelihood approach of these two probabilities.
θ is parameters of the model. In our case U and V - input, output word vectors.
Took log on
both side
●Now, maximizing log likelihood = minimizing negative log likelihood.
●
● D ̃ s “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over"
● Generate D ̃ n he ly y an only nllys hes nhse lanl he onao yn .
● For skip-gram, our new objective function for observing the context word wc − m + j given
the center word wc would be :
regular softmax loss for skip-gram
● Likewise for CBOW, our new objective function for observing the center
word uc given the context vector
● I he nyne lnaluynhsn , {u˜k |k = 1 . . . K} are sampled from Pn(w).
● best Pn(w) = Unigram distribution raised to the power of 3/4
● Usually K = 20-30 works well.
regular softmax loss for CBOW
GloVe
Global matrix factorization methods
● Use co-occurrence counts
● Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert)
+ Fast training
+ Efficient usage of statistics
+ Captures word similarity
- Do badly on analogy tasks
- Disproportionate importance given to large counts
58
Local context window method
● Use window to determine context of a word
● Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert & Weston)
+ Capture word similarity.
+ Also performance better on analogy tasks
- Slow down with increase in corpus size
- Inefficient usage of statistics
59
Combining the best of both worlds
● Glove model tries to combine the two major model families :-
○ Global matrix factorization (co-occurrence counts)
○ Local context window (context comes from window)
= Co-occurrence counts with context distance
60
Co-occurrence counts with context distance
● Uses context distance : weight each word in context window using its
distance from the center word
● This ensures nearby words have more influence than far off ones.
● Sentence -> “I ys NLP”
○ Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5
○ Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0
○ Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0
● Corpus C: I like NLP. I like cricket.
Co-occurrence matrix for C
61
Issues with Co-occurrence Matrix
● Long tail distribution
● Frequent words contribute disproportionately
(use weight function to fix this)
● Use Log for normalization
● Avoid log 0 : Add 1 to each Xij X21
62
Intuition for Glove
●Think of matrix factorization algorithms used in recommendation systems.
●Latent Factor models
○ Find features that describe the characteristics of rated objects.
○ Item characteristics and user preferences are described using vectors which are called factor
vectors
○ Assumption: Ratings can be inferred from a model put together from a smaller number of
parameters
63
Latent Factor models
● Dot product estimates user’s interest in the item
○ where, qi : factor vector for item i.
pu : factor vector for user u
i : estimated user interest
● How to compute vectors for items and users ?
64
Matrix Factorization
●rui : known rating of user u for item i
● predicted rating :
● Similarly glove model tries to model the co-occurrence counts with the
following equation :
65
Weighting function
.
●Properties of f(X)
○vanish at 0 i.e. f(0) = 0
○monotonically increasing
○f(x) should be relatively small for large values of x
● Empirically 𝞪 = 0.75, xmax=100 works best
66
Loss Function
● Scalable.
● Fast training
○ Tans s hsl on ’h o l o n he cnalu sz
○ Always fitting to a |V| x |V| matrix.
● Good performance with small corpus, and small vectors.
67
●Input :
○Xij (|V| x |V| matrix) : co-occurrence matrix
●Parameters
○ W (|V| x |D| lnhasx) & W˜ (|V| x |D| lnhasx) :
■ wi and wj˜ a la hnhsn nl he sth & jth onao lanl W n o W˜ lnhasc a l chse y .
○bi (|V| x 1) column vector : variable for incorporating biases in terms
○bj (1 x |V|) row vector : variable for incorporating biases in terms
68
Training
● Train on Wikipedia data
●|V| = 2000
● Window size = 3
● Iterations = 10000
●D = 50
●Learn two representations for each word in |V|.
●reg = 0.01
●Use momentum optimizer with momentum=0.9.
69
Quick Experiment
Results - months & centuries
70
Countries & languages
71
military terms
72
Music
73
Countries & Languages
Languages
Countries
74
t-SNE
Objective
● Given a collection of N high-dimensional objects x1, x2, …. xN.
● How can we get a feel for how these objects are (relatively) arranged ?
76
Introduction
●Busyo lnl(yno osl sn ) .h. os hn c y ho lns h a ly ch “ slsynashs ” s
the data :
●Minimize some objective function that measures the discrepancy between
similarities in the data and similarities in the map
77
Principal Components Analysis
78
Principal component analysis
● PCA mainly tries to preserve large pairwise distances in the map.
●Is that what we want ?
79
Goals
● Preserve Distances
● Preservation Neighborhood of each point
80
t-SNE High dimension
●Measure pairwise similarities between high dimensional objects
81
xi
xj
t-SNE Lower dimension
●Measure pairwise similarities between low dimensional map points
82
t-SNE
●We have measure of similarity of data points in High Dimension
●We have measure of similarity of data points in Low Dimension
●We need a distance measure between the two.
●Once we have distance measure, all we want is : to minimize it
83
One possible choice - KL divergence
● It’s a measure of how one probability distribution diverges from a second
expected probability distribution
84
KL divergence applied to t-SNE
Objective function (C)
● We want nearby points in high-D to remain nearby in low-D
○ In the case it's not, then
■ pij will large (because points are nearby)
■ but qij will be small (because points are far away)
■ This will result in larger penalty
■ In contrast, If both pij and qij are large : lower penalty 85
KL divergence applied to t-SNE
●Likewise, we want far away points in high-D to remain (relatively) far away in
low-D
○ In the case it's not, then
■ pij will small (because points are far away)
■ but qij will be large (because points are nearby)
■ This will result in lower penalty
● t-SNE mainly preserves local similarity structure of the data
86
t-Distributed Stochastic Neighbor Embedding
●Move points around to minimize :
87
Why a Student t-Distribution ?
●t-SNE tries to retain local structure of this data in the map
●Result : dissimilar points have to be modelled as far apart in the map
●Hinton, has showed that student t-distribution is very similar to gaussian
distribution
88
Local structures
global structure
● Local structures preserved
● global structure is lost
Deciding the effective number of neighbours
● We need to decide the radii in different parts of the space, so that we can keep
the effective number of neighbours about constant.
● A big radius leads to a high entropy for the distribution over neighbors of i.
● A small radius leads to a low entropy.
● So decide what entropy you want and then find the radius that produces that
entropy.
● It's easier to specify 2entropy
○ This is called the perplexity
○ It is the effective number of neighbors.
89
90
Experiments
Hyper parameters really matter: Playing with perplexity
● projected 100 data points clearly separated in two different clusters with tSNE
● Applied tSNE with different values of perplexity
● With perplexity=2, local variations in the data dominate
● With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data
91
Hyper parameters really matter: Playing with #iterations
● Perplexity set to 30.0
● Applied tSNE with different number of iterations
● Takeaway : different datasets may require different number of iterations
92
Cluster sizes can be misleading
● Uses tSNE to plot two clusters with different standard deviation
● bottomline, we cannot see cluster sizes in t-SNE plots
93
Distances in t-SNE plots
● At lower perplexity clusters look equidistant
● At perplexity=50, tSNE captures some notion of global geometry in the data
● 50 data points in each sub cluster
94
Distances in t-SNE plots
● tSNE is not able to capture global geometry even at perplexity=50.
● key take away : well separated clusters may not mean anything in tSNE.
● 200 data points in each sub cluster
95
Random noise doesn’t always look random
● For this experiment, we generated random points from gaussian distribution
● Plots with lower perplexity, showing misleading structures in the data
96
You can see some shapes sometimes
● Axis aligned gaussian distribution
● For certain values of perplexity, long cluster look almost correct.
● tSNE tends to expands regions which are denser
97
98
Why word2vec does
better than others ?
99
At heart they are all same !!
●Its has been shown that in essence GloVe and word2vec are no different
from traditional methods like PCA, LSA etc (Levy et al. 2015 call them
DSM )
●GloVe ⋍ PCA/LSA is straightforward (both factorize global counts
matrix)
●word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015)
●They show that in essence word2vec also factorizes word context matrix
(PMI)
100
●Despite this “equality” of algorithm, word2vec is still known to do better
on several tasks.
●Why ?
○Levy et al. 2015 show : magic lies in Hyperparameters
101
Hyperparameters
●Pre-processing
○ Dynamic context window
○ Subsampling frequent words
○ Deleting rare words
●Post-processing
○ Adding context words
○ Vector normalization
Pre-processing
●Dynamic Context window
○ In DSM, context window: unweighted & constant size.
○ Glove & SGNS - give more weightage to closer terms
○ SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize.
●Subsampling frequent words
○ SGNS dilutes frequent words by randomly removing words whose frequency f is higher than
some threshold t, with probability
●Deleting rare words
○ In SGNS, rare words are also deleted before creating context windows. 102
Post-processing
●Adding context vectors
○ Glove adds word vectors and the context vectors for the final representation.
●Vector normalization
○ All vectors can be normalized to unit length
103
Key Take Home
●Hyperparameters vs Algorithms
○ Hyper parameter settings is more important than the algorithm choice
○ No single algorithm consistently outperforms the other ones
●Hyperparameters vs more data
○ Training on larger corpus helps on some tasks
○ In many cases, tuning hyperparameters in more beneficial
104
References
Idea of word vectors is not new.
• Learning representations by back-propagating errors (Rumelhart et al. 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP from Scratch (Collobert & Weston, 2008)
• Word2Vec (Mikolov et al. 2013)
•Sebastian Ruder’s 3 part Blog series
•Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher
•word2vec Parameter Learning Explained by X Rong
105
References
• GloVe :
•https://nlp.stanford.edu/pubs/glove.pdf
• https://www.youtube.com/watch?v=tRsSi_sqXjI
• http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
• https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf
• t-SNE:
•http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
• http://distill.pub/2016/misread-tsne/
• https://www.slideshare.net/ssuserb667a8/visualization-data-using-tsne
• https://youtu.be/RJVL80Gg3lA
• KL Divergence
• http://tdhopper.com/blog/2015/Sep/04/cross-entropy-and-kl-divergence/
106
References
• Cross Entropy :
• https://www.youtube.com/watch?v=tRsSi_sqXjI
• http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
• Softmax:
• https://en.wikipedia.org/wiki/Softmax_function
• http://cs231n.github.io/linear-classify/#softmax
• Tensor Flow
• 1.0 API docs
• CS20SI
107
https://fifthelephant.talkfunnel.com/2017/17-learning-representations-of-text-for-nlp
108
Appendix
109
Bag of Words
• Vocab = set of all the words in corpus
• Document = Words in document w.r.t vocab with multiplicity
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
110
Pros & Cons
+ Quick and Simple
- Too simple
- Orderless
- No notion of syntactic/semantic similarity
111
N-gram model
• Vocab = set of all n-grams in corpus
• Document = n-grams in document w.r.t vocab with multiplicity
For bigram:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and,
and the}
Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
112
Pros & Cons
+ Tries to incorporate order of words
- Very large vocab set
- No notion of syntactic/semantic similarity
113
Term Frequency–Inverse Document Frequency (TF-IDF)
• Captures importance of a word to a document in a corpus.
• Importance increases proportionally to the number of times a word appears in the
document; but is offset by the frequency of the word in the corpus.
• TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document).
• IDF(t) = log (Total number of documents / Number of documents with term t in
it).
• TF-IDF (t) = TF(t) * IDF(t)
114
Example
• Document D1 contains 100 words.
• cat appears 3 times in D1
• TF(cat) = 3 / 100
= 0.3
• Corpus contains 10 million documents
• cat appears in 1000 documents
• IDF(cat) = log (10,000,000 / 1,000)
= 4
• TF-IDF (cat) = 0.3 * 4
115
Pros & Cons
• Pros:
• Easy to compute
• Has some basic metric to extract the most descriptive terms in a document
• Thus, can easily compute the similarity between 2 documents using it
• Disadvantages:
• Based on the bag-of-words (BoW) model, therefore it does not capture position
in text, semantics, co-occurrences in different documents, etc.
• Thus, TF-IDF is only useful as a lexical level feature. (presence/absense)
• Cannot capture semantics (unlike topic models, word embeddings)
116
● Positive Pointwise Mutual Information (PPMI): PMI is a common measure for the strength of
association between two words. It is defined as the log ratio between the joint probability of two
words ww and cc and the product of their marginal probabilities:
a. PMI(w,c)=logP(w,c)/P(w)P(c)
b. PPMI(w, c) = max(PMI(w,c), 0)
117

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisMostapha Benhenda
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text MiningWill Stanton
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 

Was ist angesagt? (20)

Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysis
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

Ähnlich wie Deep Learning Bangalore meet up

Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptxKtonNguyn2
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptxAbdusSadik
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowBruno Gonçalves
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language ModelBreaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language ModelSsu-Rui Lee
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesDaniil Mirylenka
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxavinashBajpayee1
 

Ähnlich wie Deep Learning Bangalore meet up (20)

Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptx
 
Science in text mining
Science in text miningScience in text mining
Science in text mining
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Word embedding
Word embedding Word embedding
Word embedding
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language ModelBreaking the Softmax Bottleneck: a high-rank RNN Language Model
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph Summaries
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 

Kürzlich hochgeladen

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 

Kürzlich hochgeladen (20)

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 

Deep Learning Bangalore meet up

  • 1. Representation Learning of Text : Word Vectors Anuj Gupta Satyam Saxena @anujgupta82, @Satyam8989 anujgupta82@gmail.com, satyamiitj89@gmail.com
  • 2. Outline • Session 1 •Introduction •Bigram model •Skip Gram model •CBOW model •Evaluation •Speed Up •Session 2 •Glove •T-SNE •Secret Ingredients 2
  • 3. Introduction Example of NLP tasks : Easy • Spell Checking • Keyword Search • Finding Synonyms Medium • Parsing information from websites, documents, etc. 3
  • 4. 4 Hard • Machine Translation (e.g. Translate Chinese text to English) • Semantic Analysis (What is the meaning of query statement?) • Co-reference (e.g. What does "he" or "it" refer to given a document?) • Question Answering (e.g. Answering Jeopardy questions). The first and arguably most important common denominator across all NLP tasks is : how we represent text as input to our models.
  • 5. • Machine does not understand text. • We need numeric representation • An integral part of any NLP pipeline. • Unlike images (RGB matrix), for text there is no obvious way. Legacy Techniques* • Bag of words • N-gram • TF-IDF 5* Details in appendix
  • 6. Bottom Line • More often than not, how rich your input representation is has huge bearing on the quality of your downstream ML models. • For NLP, archaic techniques treat words as atomic symbols. Thus every 2 words are equally apart. • They don’t have any notion of either syntactic or semantic similarity between parts of language. • This is one of the chief reasons for poor/mediocre performance of NLP based models. But this has changed dramatically in past few years 6
  • 7. Distributional & Distributed Representations 7
  • 8. Distributional representations • Linguistic aspect. • Based on co-occurrence/ context • Distributional hypothesis: linguistic units with similar distributions have similar meanings. • The distributional property is usually induced from document or context or textual vicinity (like sliding window). 8
  • 9. Distributed representations • Compact, dense and low dimensional representation. • Differs from distributional representations as the constraint is to seek efficient dense representation, not just to capture the co-occurrence similarity. • Each single component of vector representation does not have any meaning of its own. • The interpretable features (for example, word contexts in case of word2vec) are hidden and distributed among uninterpretable vector components. 9
  • 10. • Embedding: Mapping between space with one dimension per linguistic unit (word, character, phrase, sentence, document ) to a continuous vector space with much lower dimension. “You shall know a word by the company it keeps” - J R Firth • One of the most successful ideas of modern statistical NLP 10
  • 12. Co-occurrence with SVD • Define a word using the words in its context. • Words that co-occur • Building a co-occurrence matrix M. Context = previous word and next word Corpus ={“I like deep learning.” “I like NLP.” “I enjoy flying.”} 12
  • 13. • Imagine we do this for a large corpus of text • row vector xdog describes usage of word dog in the corpus • can be seen as coordinates of point in n-dimensional Euclidean space Rn • Reduce dimensions using SVD = M 13
  • 14. • Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n • M = U Σ VT • U is an m × m orthogonal matrix (UUT = I) • Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0, where r = min(m, n)) [σi’s are known as singular values] • V is an n × n orthogonal matrix (VVT = I) • We construct M’ s.t. rank(M’) = k • We compute M’ = U Σ’ V, where Σ’ = Σ with k largest singular values • k captures desired percentage variance • Then, submatrix U v,k is our desired word embedding matrix. 14
  • 15. Result of SVD based Model K = 2 K = 3 15
  • 16. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence Rohde et al. 2005 16
  • 17. Pros & Cons + Simple method + Captures some sense (though weak) of similarity between words. - Matrix is extremely sparse. - Quadratic cost to train (perform SVD) - Drastic imbalance in frequencies can adversely impact quality of embeddings. - Adding new words is expensive. Take home : we worked with statistics of the corpus rather than working with the corpus directly. This will recur in GloVe 17
  • 18. BiGram Model Idea: Directly learn low-dimensional word vectors ? 18
  • 19. Language Models • Filter out good sentences from bad ones. • Good = semantically and syntactically correct. • Modeled this via probability of given sequence of n words Pr (w1, w2, ….., wn) • S1 = “the cat jumped over the dog”, Pr(S1) ~ 1 • S2 = “jumped over the the cat dog”, Pr(S2) ~ 0 19
  • 22. BiGram Model • Objective : given wi , predict wi+1 • Training data: given sequence of n words < w1, w2, ….., wn >, extract bi-gram pairs (wi-1 , wi) • Knowns: • input – output training examples : (wi-1 , wi) • Vocab of training corpus (V) = U (wi) • Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. • Model : shallow net 22
  • 24. • Feed index of wi-1 as input to network. • Use index to lookup embedding matrix. • Perform affine transform on word embedding to get a score vector. • Compute probability for each word. • Set 1-hot vector of wi as target. • Set loss = cross-entropy between probability vector and target vector. Steps 24
  • 27. 27 ●Per word, we have 2 vectors : 1. As row in Embedding layer (E) 2. As column in weights layer (used for afine transformation) ●It’s common to take average of the 2 vectors. ●It’s common to normalise the vectors. Divide by norm. ●An alternative way to compute ŷi : # (wi, wi-1) / # (wj, wi-1) ∀ j∈V ●Use co-occurrence matrix to compute these counts. Remarks
  • 28. I learn best with toy code, that I can play with. - Andrew Trask jupyter notebook 1 28
  • 30. CBOW • Continuous Bag of words. • Proposed by Mikolov et al. in 2013 • Conceptually, very similar to Bi-gram model • In the bigram model, there were 2 key drawbacks: 1. The context was very small – we took only wi-1 , while predicting wi 2. Context is not just preceding words; but following words too. 30
  • 31. • “the brown cat jumped over the dog” Context = the brown cat over the dog Target = jumped • Context window = k words on either side of the word to be predicted. • Pr (w1, w2, ….., wn) = ∏ Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k) • W = total number of unique windows • Each window is sliding block 2c+1 words 31
  • 32. CBOW Model • Objective : given wc−k, . . . , wc−1, wc+1, . . . , wc+k , predict wc • Training data: given sequence of n words < w1, w2, ….., wn >, for each window extract context and target (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc ) • Knowns: • input – output training examples : (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc ) • Vocab of training corpus (V) = ∪(wi) • Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. 32
  • 34. • Feed indexes of (x(c−k) , ... , x(c−1) , x(c+1) , ... , x(c+k)) for the input context of size k. • Use indexes to lookup embedding matrix. • Average these vectors to get vˆ = (vc−k+vc−1+...+vc+1+vc+k ) / 2m • Perform affine transform on vˆ to get a score vector. • Turn scores in probabilities for each word. • Set 1-hot vector of wc as target. • Set loss = cross-entropy between probability vector and target vector. Steps 34
  • 35. Maths behind the scene • Optimization objective J = - log Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k) • Maximizing Pr() = Minimizing – log Pr() • Let vˆ = (wc−k + . . . + wc−1 + wc+1 + . . . + wc+k )/ 2m • Then, RHS • gradient descent to update all relevant word vectors uc and wj. 35
  • 36. Skip-Gram model • 2nd model proposed by Mikolov et al. in 2013 • Turns CBOW over its head. • CBOW = given context, predict the target word • Skip Gram = given target, predict context • “the brown cat jumped over the dog” Target = jumped Context = the, brown, cat, over, the, dog 36
  • 37. • Objective : given wc , predict wc−k, . . . , wc−1, wc+1, . . . , wc+k • Training data: given sequence of n words < w1, w2, ….., wn >, for each window extract target and context pairs (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k) • Knowns: • input – output training examples : (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k) • Vocab of training corpus (V) = ∪ (wi) • Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding dimensions. Usually a hyper parameter. 37
  • 39. • Feed index of xc • Use index to lookup embedding matrix. • Perform affine transform on vˆ to get a score vector. • Turn scores in probabilities for each word. • Set 1-hot vector of wc as target. • Set loss = cross-entropy between probability vector and target vector. Steps 39
  • 40. Maths behind the scene • Optimization objective J = - log Pr(wc−k, . . . , wc−1, wc+1, . . . , wc+k | , wc) • gradient descent to update all relevant word vectors uc and wj. 40
  • 42. • How to quantitatively evaluate the quality of word vectors? • Intrinsic Evaluation : • Word Vector Analogies • Extrinsic Evaluation : • Downstream NLP task 42
  • 43. Intrinsic Evaluation • Specific Intermediate subtasks • Easy to compute. • Analogy completion: • a:b :: c:? d = man:woman :: king:? • Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions • Discarding the input words from the search! • Problem: What if the information is there but not linear? 43
  • 44. 44
  • 45. Extrinsic Evaluation • Real task at hand • Ex: Sentiment analysis. • Not very robust. • End result is a function of whole process and not just embeddings. • Process: • Data pipelines • Algorithm(s) • Fine tuning • Quality of dataset 45
  • 47. Bottleneck • Recall, to calculate probability, we use softmax. The denominator is sum across entire vocab. • Further, this is calculated for every window. • Too expensive. • Single update of parameters requires to iterate over |V|. Our vocab usually is in millions. 47
  • 48. To approximate probability, dont use the entire vocab. There are 2 popular line of attacks to achieve this: •Modify the structure the softmax •Hierarchical Softmax • Sampling techniques : don’t use entire vocabulary to compute the sum • Negative sampling 48
  • 49. ● Arrange words in vocab as leaf units of a balanced binary tree. ● |V| leaves |V| - 1 internal nodes ● Each leaf node has a unique path from root to the leaf ● Probability of a word (leaf node Lw) = Probability of the path from root node to leaf Lw ● No output vector representation for words, unlike softmax. ● Instead every internal node has a d-dimension vector associated with it - v’n(w, j) Hierarchical Softmax n(w, j) means the j-th unit on the path from root to the word w
  • 50. ● Product of probabilities over nodes in the path ● Each probability is computed using sigmoid ● ● Inside it we check : if (j+1)th node on path left child of jth node or not ● v’n(w, j) T h : vector product between vector on hidden layer and vector for the inner node in consideration.
  • 51. ● p(w = w2) ● We start at root, and navigate to leaf w2 ● ● ● p(w = w2) ● Example
  • 52. ● Cost: O(|V|) to O(log |V| ) ●In practice, use Huffman tree
  • 53. Negative Sampling ●Given (w, c) : word and context ●Let P(D=1|w,c) be probability that (w, c) came from the corpus data. ●P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data. ● Lets model P(D=1|w,c) with sigmoid: ●Objective function (J): ○ maximize P(D=1|w,c) if (w, c) is in the corpus data. ○ maximize P(D=0|w,c) if (w, c) is not in the corpus data. ●We take a simple maximum likelihood approach of these two probabilities.
  • 54. θ is parameters of the model. In our case U and V - input, output word vectors. Took log on both side
  • 55. ●Now, maximizing log likelihood = minimizing negative log likelihood. ● ● D ̃ s “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over" ● Generate D ̃ n he ly y an only nllys hes nhse lanl he onao yn . ● For skip-gram, our new objective function for observing the context word wc − m + j given the center word wc would be : regular softmax loss for skip-gram
  • 56. ● Likewise for CBOW, our new objective function for observing the center word uc given the context vector ● I he nyne lnaluynhsn , {u˜k |k = 1 . . . K} are sampled from Pn(w). ● best Pn(w) = Unigram distribution raised to the power of 3/4 ● Usually K = 20-30 works well. regular softmax loss for CBOW
  • 57. GloVe
  • 58. Global matrix factorization methods ● Use co-occurrence counts ● Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert) + Fast training + Efficient usage of statistics + Captures word similarity - Do badly on analogy tasks - Disproportionate importance given to large counts 58
  • 59. Local context window method ● Use window to determine context of a word ● Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert & Weston) + Capture word similarity. + Also performance better on analogy tasks - Slow down with increase in corpus size - Inefficient usage of statistics 59
  • 60. Combining the best of both worlds ● Glove model tries to combine the two major model families :- ○ Global matrix factorization (co-occurrence counts) ○ Local context window (context comes from window) = Co-occurrence counts with context distance 60
  • 61. Co-occurrence counts with context distance ● Uses context distance : weight each word in context window using its distance from the center word ● This ensures nearby words have more influence than far off ones. ● Sentence -> “I ys NLP” ○ Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5 ○ Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0 ○ Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0 ● Corpus C: I like NLP. I like cricket. Co-occurrence matrix for C 61
  • 62. Issues with Co-occurrence Matrix ● Long tail distribution ● Frequent words contribute disproportionately (use weight function to fix this) ● Use Log for normalization ● Avoid log 0 : Add 1 to each Xij X21 62
  • 63. Intuition for Glove ●Think of matrix factorization algorithms used in recommendation systems. ●Latent Factor models ○ Find features that describe the characteristics of rated objects. ○ Item characteristics and user preferences are described using vectors which are called factor vectors ○ Assumption: Ratings can be inferred from a model put together from a smaller number of parameters 63
  • 64. Latent Factor models ● Dot product estimates user’s interest in the item ○ where, qi : factor vector for item i. pu : factor vector for user u i : estimated user interest ● How to compute vectors for items and users ? 64
  • 65. Matrix Factorization ●rui : known rating of user u for item i ● predicted rating : ● Similarly glove model tries to model the co-occurrence counts with the following equation : 65
  • 66. Weighting function . ●Properties of f(X) ○vanish at 0 i.e. f(0) = 0 ○monotonically increasing ○f(x) should be relatively small for large values of x ● Empirically 𝞪 = 0.75, xmax=100 works best 66
  • 67. Loss Function ● Scalable. ● Fast training ○ Tans s hsl on ’h o l o n he cnalu sz ○ Always fitting to a |V| x |V| matrix. ● Good performance with small corpus, and small vectors. 67
  • 68. ●Input : ○Xij (|V| x |V| matrix) : co-occurrence matrix ●Parameters ○ W (|V| x |D| lnhasx) & W˜ (|V| x |D| lnhasx) : ■ wi and wj˜ a la hnhsn nl he sth & jth onao lanl W n o W˜ lnhasc a l chse y . ○bi (|V| x 1) column vector : variable for incorporating biases in terms ○bj (1 x |V|) row vector : variable for incorporating biases in terms 68 Training
  • 69. ● Train on Wikipedia data ●|V| = 2000 ● Window size = 3 ● Iterations = 10000 ●D = 50 ●Learn two representations for each word in |V|. ●reg = 0.01 ●Use momentum optimizer with momentum=0.9. 69 Quick Experiment
  • 70. Results - months & centuries 70
  • 75. t-SNE
  • 76. Objective ● Given a collection of N high-dimensional objects x1, x2, …. xN. ● How can we get a feel for how these objects are (relatively) arranged ? 76
  • 77. Introduction ●Busyo lnl(yno osl sn ) .h. os hn c y ho lns h a ly ch “ slsynashs ” s the data : ●Minimize some objective function that measures the discrepancy between similarities in the data and similarities in the map 77
  • 79. Principal component analysis ● PCA mainly tries to preserve large pairwise distances in the map. ●Is that what we want ? 79
  • 80. Goals ● Preserve Distances ● Preservation Neighborhood of each point 80
  • 81. t-SNE High dimension ●Measure pairwise similarities between high dimensional objects 81 xi xj
  • 82. t-SNE Lower dimension ●Measure pairwise similarities between low dimensional map points 82
  • 83. t-SNE ●We have measure of similarity of data points in High Dimension ●We have measure of similarity of data points in Low Dimension ●We need a distance measure between the two. ●Once we have distance measure, all we want is : to minimize it 83
  • 84. One possible choice - KL divergence ● It’s a measure of how one probability distribution diverges from a second expected probability distribution 84
  • 85. KL divergence applied to t-SNE Objective function (C) ● We want nearby points in high-D to remain nearby in low-D ○ In the case it's not, then ■ pij will large (because points are nearby) ■ but qij will be small (because points are far away) ■ This will result in larger penalty ■ In contrast, If both pij and qij are large : lower penalty 85
  • 86. KL divergence applied to t-SNE ●Likewise, we want far away points in high-D to remain (relatively) far away in low-D ○ In the case it's not, then ■ pij will small (because points are far away) ■ but qij will be large (because points are nearby) ■ This will result in lower penalty ● t-SNE mainly preserves local similarity structure of the data 86
  • 87. t-Distributed Stochastic Neighbor Embedding ●Move points around to minimize : 87
  • 88. Why a Student t-Distribution ? ●t-SNE tries to retain local structure of this data in the map ●Result : dissimilar points have to be modelled as far apart in the map ●Hinton, has showed that student t-distribution is very similar to gaussian distribution 88 Local structures global structure ● Local structures preserved ● global structure is lost
  • 89. Deciding the effective number of neighbours ● We need to decide the radii in different parts of the space, so that we can keep the effective number of neighbours about constant. ● A big radius leads to a high entropy for the distribution over neighbors of i. ● A small radius leads to a low entropy. ● So decide what entropy you want and then find the radius that produces that entropy. ● It's easier to specify 2entropy ○ This is called the perplexity ○ It is the effective number of neighbors. 89
  • 91. Hyper parameters really matter: Playing with perplexity ● projected 100 data points clearly separated in two different clusters with tSNE ● Applied tSNE with different values of perplexity ● With perplexity=2, local variations in the data dominate ● With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data 91
  • 92. Hyper parameters really matter: Playing with #iterations ● Perplexity set to 30.0 ● Applied tSNE with different number of iterations ● Takeaway : different datasets may require different number of iterations 92
  • 93. Cluster sizes can be misleading ● Uses tSNE to plot two clusters with different standard deviation ● bottomline, we cannot see cluster sizes in t-SNE plots 93
  • 94. Distances in t-SNE plots ● At lower perplexity clusters look equidistant ● At perplexity=50, tSNE captures some notion of global geometry in the data ● 50 data points in each sub cluster 94
  • 95. Distances in t-SNE plots ● tSNE is not able to capture global geometry even at perplexity=50. ● key take away : well separated clusters may not mean anything in tSNE. ● 200 data points in each sub cluster 95
  • 96. Random noise doesn’t always look random ● For this experiment, we generated random points from gaussian distribution ● Plots with lower perplexity, showing misleading structures in the data 96
  • 97. You can see some shapes sometimes ● Axis aligned gaussian distribution ● For certain values of perplexity, long cluster look almost correct. ● tSNE tends to expands regions which are denser 97
  • 99. 99 At heart they are all same !! ●Its has been shown that in essence GloVe and word2vec are no different from traditional methods like PCA, LSA etc (Levy et al. 2015 call them DSM ) ●GloVe ⋍ PCA/LSA is straightforward (both factorize global counts matrix) ●word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015) ●They show that in essence word2vec also factorizes word context matrix (PMI)
  • 100. 100 ●Despite this “equality” of algorithm, word2vec is still known to do better on several tasks. ●Why ? ○Levy et al. 2015 show : magic lies in Hyperparameters
  • 101. 101 Hyperparameters ●Pre-processing ○ Dynamic context window ○ Subsampling frequent words ○ Deleting rare words ●Post-processing ○ Adding context words ○ Vector normalization
  • 102. Pre-processing ●Dynamic Context window ○ In DSM, context window: unweighted & constant size. ○ Glove & SGNS - give more weightage to closer terms ○ SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize. ●Subsampling frequent words ○ SGNS dilutes frequent words by randomly removing words whose frequency f is higher than some threshold t, with probability ●Deleting rare words ○ In SGNS, rare words are also deleted before creating context windows. 102
  • 103. Post-processing ●Adding context vectors ○ Glove adds word vectors and the context vectors for the final representation. ●Vector normalization ○ All vectors can be normalized to unit length 103
  • 104. Key Take Home ●Hyperparameters vs Algorithms ○ Hyper parameter settings is more important than the algorithm choice ○ No single algorithm consistently outperforms the other ones ●Hyperparameters vs more data ○ Training on larger corpus helps on some tasks ○ In many cases, tuning hyperparameters in more beneficial 104
  • 105. References Idea of word vectors is not new. • Learning representations by back-propagating errors (Rumelhart et al. 1986) • A neural probabilistic language model (Bengio et al., 2003) • NLP from Scratch (Collobert & Weston, 2008) • Word2Vec (Mikolov et al. 2013) •Sebastian Ruder’s 3 part Blog series •Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher •word2vec Parameter Learning Explained by X Rong 105
  • 106. References • GloVe : •https://nlp.stanford.edu/pubs/glove.pdf • https://www.youtube.com/watch?v=tRsSi_sqXjI • http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/ • https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf • t-SNE: •http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf • http://distill.pub/2016/misread-tsne/ • https://www.slideshare.net/ssuserb667a8/visualization-data-using-tsne • https://youtu.be/RJVL80Gg3lA • KL Divergence • http://tdhopper.com/blog/2015/Sep/04/cross-entropy-and-kl-divergence/ 106
  • 107. References • Cross Entropy : • https://www.youtube.com/watch?v=tRsSi_sqXjI • http://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/ • Softmax: • https://en.wikipedia.org/wiki/Softmax_function • http://cs231n.github.io/linear-classify/#softmax • Tensor Flow • 1.0 API docs • CS20SI 107
  • 110. Bag of Words • Vocab = set of all the words in corpus • Document = Words in document w.r.t vocab with multiplicity Sentence 1: "The cat sat on the hat" Sentence 2: "The dog ate the cat and the hat” Vocab = { the, cat, sat, on, hat, dog, ate, and } Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 } Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1} 110
  • 111. Pros & Cons + Quick and Simple - Too simple - Orderless - No notion of syntactic/semantic similarity 111
  • 112. N-gram model • Vocab = set of all n-grams in corpus • Document = n-grams in document w.r.t vocab with multiplicity For bigram: Sentence 1: "The cat sat on the hat" Sentence 2: "The dog ate the cat and the hat” Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and, and the} Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0} Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1} 112
  • 113. Pros & Cons + Tries to incorporate order of words - Very large vocab set - No notion of syntactic/semantic similarity 113
  • 114. Term Frequency–Inverse Document Frequency (TF-IDF) • Captures importance of a word to a document in a corpus. • Importance increases proportionally to the number of times a word appears in the document; but is offset by the frequency of the word in the corpus. • TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). • IDF(t) = log (Total number of documents / Number of documents with term t in it). • TF-IDF (t) = TF(t) * IDF(t) 114
  • 115. Example • Document D1 contains 100 words. • cat appears 3 times in D1 • TF(cat) = 3 / 100 = 0.3 • Corpus contains 10 million documents • cat appears in 1000 documents • IDF(cat) = log (10,000,000 / 1,000) = 4 • TF-IDF (cat) = 0.3 * 4 115
  • 116. Pros & Cons • Pros: • Easy to compute • Has some basic metric to extract the most descriptive terms in a document • Thus, can easily compute the similarity between 2 documents using it • Disadvantages: • Based on the bag-of-words (BoW) model, therefore it does not capture position in text, semantics, co-occurrences in different documents, etc. • Thus, TF-IDF is only useful as a lexical level feature. (presence/absense) • Cannot capture semantics (unlike topic models, word embeddings) 116
  • 117. ● Positive Pointwise Mutual Information (PPMI): PMI is a common measure for the strength of association between two words. It is defined as the log ratio between the joint probability of two words ww and cc and the product of their marginal probabilities: a. PMI(w,c)=logP(w,c)/P(w)P(c) b. PPMI(w, c) = max(PMI(w,c), 0) 117