Lda and it's applications

LDA and it’s applications
AI HACKERS

What is LDA?
 LDA stands for latent dirichlet allocation
 It is basically of distribution of words in topic k (let’s say 50) with probability of
topic k occurring in document d (let’s say 5000)
 Mechanism - It uses special kind of distribution called Dirichlet Distribution which
is nothing but multi—variate generalization of Beta distribution of probability
density function

LDA in layman terms
Sentence 1: I spend the evening watching football
Sentence 2: I ate nachos and guacamole.
Sentence 3: I spend the evening watching football while eating nachos and guacamole.
LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 65% is Topic 1, 35% Topic 2
But also tells that
Topic 1 is about football (50%), evening (50%),
topic 2 is about nachos (50%), guacamole (50)%

LDA is Bayesian Network of Probability
Density function

LDA history
Andrew NgDavid Blei Michael I Jordan

A simple LDA
https://ai.stanford.edu/~ang/papers/nips01-lda.pdf

Packages used in python
 sudo pip install nltk
 sudo pip install genism
 sudo pip intall stop-words

Stop words
 Stop words are commonly occurring words which doesn’t contribute to topic
modelling.
 the, and, or
 However, sometimes, removing stop words affect topic modelling
 For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it
will be removed.

Porter’s Stemmer algorithm
 A common NLP technique to reduce topically similar words to their root. For e.g., “stemming,” “stemmer,”
“stemmed,” all have similar meanings; stemming reduces those terms to “stem.”
 Important for topic modeling, which would otherwise view those terms as separate entities and reduce
their importance in the model.
 It's a bunch of rules for reducing a word:
 sses -> es
 ies -> i
 ational -> ate
 tional -> tion
 s -> ∅
 when conflicts, the longest rule wins
 Bad idea unless you customize it.

Porter’s Stemmer algorithm -Flowchart
Arabic Stemming Process
Simple Stemming Process

Lemmatization
 It goes one step further than stemming.
 It obtains grammatically correct words and distinguishes words by their word
sense with the use of a vocabulary (e.g., type can mean write or category).
 It is a much more difficult and expensive process than stemming.

CBOW v/s SKIP-GRAM
https://arxiv.org/pdf/1301.3781.pdf

LDA 2 VEC –
what really happens?
https://arxiv.org/pdf/1605.02019.pdf
LDA2VEC model adds in skipgrams.
A word predicts another word in the same window,
as in word2vec, but also has the notion of a context vector
which only changes at the document level as in LDA.

Lda2Vec – Pytorch code
 Source: https://github.com/TropComplique/lda2vec-pytorch
 Go to 20newsgroups/.
 Run get_windows.ipynb to prepare data.
 Run python train.py for training.
 Run explore_trained_model.ipynb.
 To use this on your data you need to edit get_windows.ipynb. Also there are
hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.

Lda and it's applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Lda and it's applications

Ähnlich wie Lda and it's applications (20)

Mehr von Babu Priyavrat

Mehr von Babu Priyavrat (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lda and it's applications