2. What is LDA?
ï” LDA stands for latent dirichlet allocation
ï” It is basically of distribution of words in topic k (letâs say 50) with probability of
topic k occurring in document d (letâs say 5000)
ï” Mechanism - It uses special kind of distribution called Dirichlet Distribution which
is nothing but multiâvariate generalization of Beta distribution of probability
density function
3. LDA in layman terms
Sentence 1: I spend the evening watching football
Sentence 2: I ate nachos and guacamole.
Sentence 3: I spend the evening watching football while eating nachos and guacamole.
LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 65% is Topic 1, 35% Topic 2
But also tells that
Topic 1 is about football (50%), evening (50%),
topic 2 is about nachos (50%), guacamole (50)%
8. Packages used in python
ï” sudo pip install nltk
ï” sudo pip install genism
ï” sudo pip intall stop-words
9. Stop words
ï” Stop words are commonly occurring words which doesnât contribute to topic
modelling.
ï” the, and, or
ï” However, sometimes, removing stop words affect topic modelling
ï” For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it
will be removed.
10. Porterâs Stemmer algorithm
ï” A common NLP technique to reduce topically similar words to their root. For e.g., âstemming,â âstemmer,â
âstemmed,â all have similar meanings; stemming reduces those terms to âstem.â
ï” Important for topic modeling, which would otherwise view those terms as separate entities and reduce
their importance in the model.
ï” It's a bunch of rules for reducing a word:
ï” sses -> es
ï” ies -> i
ï” ational -> ate
ï” tional -> tion
ï” s -> â
ï” when conflicts, the longest rule wins
ï” Bad idea unless you customize it.
12. Lemmatization
ï” It goes one step further than stemming.
ï” It obtains grammatically correct words and distinguishes words by their word
sense with the use of a vocabulary (e.g., type can mean write or category).
ï” It is a much more difficult and expensive process than stemming.
17. LDA 2 VEC â
what really happens?
https://arxiv.org/pdf/1605.02019.pdf
LDA2VEC model adds in skipgrams.
A word predicts another word in the same window,
as in word2vec, but also has the notion of a context vector
which only changes at the document level as in LDA.
18. Lda2Vec â Pytorch code
ï” Source: https://github.com/TropComplique/lda2vec-pytorch
ï” Go to 20newsgroups/.
ï” Run get_windows.ipynb to prepare data.
ï” Run python train.py for training.
ï” Run explore_trained_model.ipynb.
ï” To use this on your data you need to edit get_windows.ipynb. Also there are
hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.