SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Physical and Health Education, University of Tokyo
D1 Hiroyuki Kuromiya
1
● I'm 1st grade on my Ph.D course
in Physical and Health
Education.
● Despite of the name of my
course, I am currently working on
learning analytics of
research-based active learning.
● The data I have to analysis are
often text-format. That's why I
attend this class.
Hiroyuki Kuromiya
2
Today, I am going to introduce 5 papers about topic modeling.
● Indexing by Latent Semantic Analysis (Deerwester+, 1990)
● Probabilistic Latent Semantic Indexing (Hofmann, 1999)
● Latent Dirichlet Allocation (Blei+, 2003)
● Gaussian LDA for Topic Models with Word Embeddings (Das+, 2015)
● What is Wrong with Topic Modeling? (Agrawal+, 2018)
3
Since I don't have enough time to introduce whole contents in each paper, I want
to focus on 5 questions listed below.
● What is their motivation?
● What is the key point of their paper?
● How their model is?
● How to estimate parameters?
● What are deficiencies in their model?
4
5
“a topic model is a type of statistical
model for discovering the abstract
"topics" that occur in a collection of
documents.”
(Wikipedia, “Topic model”, accessed on
May 3, 2018)
(岩田具治『トピックモデル』 , 2015, pp.vii)
6
VSM is one of the most popular families of information retrieval techniques.
VSM is characterised by three ingredients.
1. a transform function (also called local term weight such as term frequency)
2. a term weighting scheme (also called global term weight such as inverse term
frequency)
3. a similarity measure such as cosine distance
We represent a semantic distance as a spatial distance.
Hofmann (1999). probabilistic latent semantic indexing, section 5.1
The vector space model for scoring
7
Deerwester, S., Dumais, S. T., Furnas, G. W.,
Landauer, T. K., & Harshman, R. (1990)
8
● Deerwester belonged to Graduate Library
School, University of Chicago.
● The aim of the study was to improve
information retrieval system.
● They thought that there was a fundamental
problem in existing retrieval techniques that
try to match words of queries with words of
documents.
Scott Deerwester (1956-)
9
If the query is “IDF in computer-based information look-up”, we think that
document 1 and 3 are relevant. However, simple term matching method would
return document 2 and 3.
Document 1 would not be returned because of synonymy effect of look-up, and
document 2 would be returned because of polysemy effect of information.
access document retrieval information theory database indexing computer
Doc1 1 1 1 1 1
Doc2 1 1 1
Doc3 1 1 1
10
They introduced “semantic-space” wherein terms
and documents that are closely associated are
placed near one another. By using “semantic space”
● We can get rid of obscuring noise in data
● We can get conceptual content that users are
really seeking
11
Considering representational richness, explicit representation of both terms and
documents, computational tractability, they proposed two-mode factor analysis, or
Singular Value Decomposition.
X T0
S0 D0
T
documents
terms
t by d t by m
m by m m by d
12
Suppose that u’s is the eigenvectors of AAT
, and v’s are the eigenvectors of AT
A.
Since those matrices are both symmetric, their eigenvectors can be chosen
orthonormal.
The simple fact that
leads to
It tells us that
Considering V is an orthonormal matrix, it becomes
Strang, Gilbert, et al. Introduction to linear algebra. Vol. 4. Wellesley, MA: Wellesley-Cambridge Press, 2009.
13
● It begins with arbitrary rectangular matrix (cf. one-mode factor analysis
requires A to be square matrix )
● It allows us to approximate original matrix using smaller matrices.
It is important that the derived k-dimensional factor space does not
reconstruct the original term space perfectly because it means getting rid of
noise of original data (cf. python code for svd).
14
1. Using T and D, you construct
semantic space
2. Find representations for the query
following the equation below
3. Calculate cosine distance between
query and documents
15
● Precision of the LSI method lies well
above that obtained with term
matching, SMART, and Voorhees.
● The average difference in precision
between the LSI and the term
matching method is .06 which
represents a 13% improvement over
raw term matching
16
● Exhaustive comparison of a query vector
against all stored document vectors
● The initial SVD analysis is time consuming;
it is hard to update.
● Lack of statistical foundation in latent
factor
“Roughly speaking, these
factors may be thought
of as artificial concepts”
(Section 4.1)
17
Thomas Hofmann (1999)
18
● He belonged to International
Computer Science Institute,
Berkeley, CA.
● In order for computers to interact
more naturally with humans, natural
language queries are needed.
● Although LSA has been applied
with remarkable success in
different domains, it does not have
satisfactory statistical foundation.
Thomas Hofmann (1968-)
19
He presented a novel approach to LSA that has a solid statistical foundation
based on the likelihood principle and proper generative model of the data.
● He used a statistical latent class called “aspect model”.
● Model is annealed adjusted by minimizing word perplexity.
20
Aspect Model is a latent variable model for general co-occurrence data which
associates an unobserved class variable.
1. Select a document d with probability P(d)
2. Pick a latent class z with probability P(z|d)
3. Generate a word w with probability P(w|z)
We call P(z|d) “aspects”.
21
The probability of single word and document is written below.
Hence, the joint probability of the whole data set is written as below.
marzinalization product rule
22
There are three parameters in the aspect model, which are p(z), p(w|z), p(d|z).
We use Expectation Maximization (EM) algorithm to estimate them.
EM algorithm is the standard procedure for maximum likelihood estimation in
latent variable model.
Before explaining EM algorithm, let me try normal maximum likelihood estimation,.
23
Let n(d,w) the term frequency w in document d, likelihood of the model is
Hence, log-likelihood is written as below.
I want to maximize this log-likelihood, but log-sum structure in the equation is
hard to differentiate.
24
By Jensen’s inequality,
By maximizing F using method of Lagrange multiplier, it leads
25
1. Initialize parameters p(z), p(w|z), p(z|d)
2. E-step
Calculate p(z|d,w) using given parameters p(z), p(w|z), p(z|d)
3. M-step
Update parameters p(z), p(w|z), p(z|d) using p(z|d,w) which has been just
calculated
4. Re-calculate new log-likelihood. Until |new log-likelihood - old log-likelihood|<ε,
we repeat E-step and M-step.
Python code for PLSA implementation
26
● It is interesting to see that pLSA
capture two different types of “flight”
and “love” in their topics. It
distinguish polysemy of the words.
● The experiments consistently
validate the advantages of PLSI over
LSI.
27
Some points are derived from Blei et al., (2003).
● pLSA is likely to be overfitting
○ They use Tempered EM which is the improved version of EM algorithm to avoid overfitting, but
it is not a fundamental solution.
● There is no statistical foundation at the level of documents.
○ In pLSA, each document is represented as the mixing proportions for topics, and there is no
generative probabilistic model for these numbers.
○ This leads to the number of parameters in the model grows linearly with the size of the
corpus, which leads to overfitting.
○ Plus, it is not clear how to assign probability to a document outside of the training set.
28
David M. Blei, Andrew Y. Ng, Michael I. Jordan (2003)
29
● Blei was in the Computer Science Division,
University of California.
● This paper consider the problem of modeling
text corpora and other collections of
discrete data.
● They thought pLSA is incomplete because it
provides no probabilistic model at the level
of documents.
30
exchangeability of both words and documents
● LSI and pLSI are based on the “bag-of-words” assumption -- that the order of
words in a document can be neglected, but less often stated that documents
are exchangeable as well as words.
● de Finetti (1990) establishes that any collection of exchangeable random
variables has a representation as a mixture distribution.
It leads to the latent Dirichlet allocation that topics are infinitely exchangeable
within a document.
31
LDA assumes the following generative process for each document in a corpus.
1. Choose N ~ Poisson(ζ)
2. Choose θ ~ Dirichlet(α)
3. For each words
a. Choose a topic z ~ Multinomial(θ)
b. Choose a word w from p(w|z,β), a multinomial conditioned on the topic z
Note that θ is a document leven variables sampled once per document.
The joint distribution is given by
32
Dirichlet distribution is conjugate to the
multinomial distribution.
Figure right shows dirichlet distribution at
different α value. You will understand that it
simply represents natural inference of human.
For example if α = (5,2,2), θ1 would be high.
https://cs.stanford.edu
33
The key inferential problem is to compute the posterior distribution of the hidden
variables given a document.
Integrating over θ and summing over z, we obtain the marginal distribution of a
document.
This is intractable due to the coupling between θ and β in the summation over
latent topics. Thus we apply an approximate inference to estimate parameters.
34
1. E-step: find optimizing values of the variational parameters γ, θ
a. Variational Inference for γ and θ
i. Initialize γ,θ
ii. repeat until convergence
2. M-step: maximizing the lower bound on the log likelihood with respect to the
model parameters α and β.
source code for python
35
● LDA constantly performs better than
other methods, unigram, mixture
unigram and pLSI.
● For classification task, the
performance is improved with LDA
features.
● For collaborative filtering task,
EachMovie, the best predictive
performance was obtained by the
LDA model.
36
● Order effect (cf. Agrawal et al., 2018)
Different topics are generated if the training data is shuffled since its internal
weights are updated via stochastic sampling process. Such effect introduce a
systematic error for any study.
● Topic coherence (cf. Das et al., 2015)
The prior preference for semantic coherence is not encoded in the model.
Some topics can be accidental for human evaluations.
● Cannot handle out-of-vocabulary (OOV) words (cf. Das et al., 2015)
37
38
Let us write the aspect model in matrix notation. Hence, define matrices by
The joint probability model P can be a matrix product
Although there is a fundamental difference between LSA and pLSA, pLSA can be
seen as a dimensionality reduction method.
39
M dimensional multinomial distribution
can be represented as points on a M-1
dimensional simplex of all possible
multinomial.
Since the dimensionality of the
sub-simplex (a probabilistic latent
semantic space) K-1 as opposed to M-1
for the complete probability simplex, this
can also be thought of dimensionality
reduction.
40
The topic simplex for three topics
embedded in the word simplex for three
words.
The pLSI model induces an empirical
distribution on the topic simplex denoted
by x. LDA places a smooth distribution
on the topic simplex denoted by the
contour lines.
41
The boxes are “plates” representing
replicates. The outer plate represents
documents, while the inner plate
represents the repeated choice of
topics and words within a document
You can easily see LDA assumes the
generative model at the level of
documents.
(d) LDA model
42
Rajarshi Das, Manzil Zaheer, Chris Dyer (2015)
43
● Das is a second year Ph.D student in School
of Computer Science, Carnegie Mellon
University.
● They want to propose a new technique for
topic modeling by using word embeddings
(Milkov, 2013)
44
According to the distributional hypothesis,
words occurring in similar contexts tend to have
similar meaning.
This has given rise to data-driven learning of
word vectors that capture lexical and semantic
properties (e.g. word2vec).
we assume that rather than consisting of
sequences of word types, documents consist of
sequences of word embeddings.
45
Since our observations are no longer discrete values but continuous vectors in an
M-dimensional space, we characterize each topic k as a multivariate Gaussian
distribution with mean μk
and covariance ∑k
.
The generative process can thus be summarized as follows.
46
1. for k=1 to K
a. Draw topic covariance
b. Draw topic mean
2. for each document d in a corpus D
a. Draw topic distribution
b. for each word
i. Draw a topic
ii. Draw embedded vector
47
We wish to infer the posterior distribution over the topic parameters, proportions
and the topic assignments of individual words.
We use a collapsed Gibbs sampler to infer them.
We can make the sampling faster using Cholesky decomposition of covariance
matrix.
source code for python
48
● To measure topic coherence, we
follow to compute Pointwise
Mutual Information (PMI) of topic
words.
● It can be seen that Gaussian LDA is
a clear winner, achieving an
average 275% higher score on
average.
49
● we select a subset of documents and
replace words of those documents by its
synonyms if they haven’t occurred in the
corpus before.
● Compared recently proposed extension
of LDA that can handle unseen words
(infvoc), Gaussian LDA performs better
here, too.
50
Amritanshu Agrawal, Wei Fu, Tim Menzies (2018)
51
This is just summary of the paper. I didn’t have enough time to read this paper
because I spent a lot of time to trying to understand parameter inference part of
pLSA and LDA. Sorry for my unplanned.
● Motivation: the current great challenge in software analytics is understanding
unstructured data.
● Key points: tuning proper parameters to fix “order effects” in LDA.
● Model: they propose LDADE, a search-based software engineering tool which
uses Differential Evolution (DE) to tune the LDA’s parameter.
● Results: LDADE’s tunings dramatically reduce cluster instability and leads to
improved performances for supervised as well as un-supervised learning
52
● Since 1990, topic modeling has been constantly needed although social
background and researcher’s motivations have been changed.
● Topic modeling is easy to expand or add other probabilistic models. The
model has been more complex as it becomes new.
● The way to estimate parameters has been evolving so that it can deal with
more flexible models.
● It is very difficult for me to understand parameter inference part. I will need
some mathematical trainings especially on optimization, which is based on
linear algebra and probability theory.
53

Weitere ähnliche Inhalte

Was ist angesagt?

[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural Language[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural LanguageJinho Choi
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesAndrew Ferlitsch
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use casesKhrystyna Skopyk
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
 
Python Libraries and Modules
Python Libraries and ModulesPython Libraries and Modules
Python Libraries and ModulesRaginiJain21
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningRoberto Pereira Silveira
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Lab report for Prolog program in artificial intelligence.
Lab report for Prolog program in artificial intelligence.Lab report for Prolog program in artificial intelligence.
Lab report for Prolog program in artificial intelligence.Alamgir Hossain
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Syed Atif Naseem
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 

Was ist angesagt? (20)

[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural Language[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural Language
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
Nlp
NlpNlp
Nlp
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use cases
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
NLTK
NLTKNLTK
NLTK
 
Python Libraries and Modules
Python Libraries and ModulesPython Libraries and Modules
Python Libraries and Modules
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Lab report for Prolog program in artificial intelligence.
Lab report for Prolog program in artificial intelligence.Lab report for Prolog program in artificial intelligence.
Lab report for Prolog program in artificial intelligence.
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Text categorization
Text categorizationText categorization
Text categorization
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
 
Language models
Language modelsLanguage models
Language models
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 

Ähnlich wie Basic review on topic modeling

Topic models
Topic modelsTopic models
Topic modelsAjay Ohri
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdfGieTe
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 

Ähnlich wie Basic review on topic modeling (20)

Topic models
Topic modelsTopic models
Topic models
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
 
Canini09a
Canini09aCanini09a
Canini09a
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdf
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Slides.ltdca
Slides.ltdcaSlides.ltdca
Slides.ltdca
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
 

Kürzlich hochgeladen

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Basic review on topic modeling

  • 1. Physical and Health Education, University of Tokyo D1 Hiroyuki Kuromiya 1
  • 2. ● I'm 1st grade on my Ph.D course in Physical and Health Education. ● Despite of the name of my course, I am currently working on learning analytics of research-based active learning. ● The data I have to analysis are often text-format. That's why I attend this class. Hiroyuki Kuromiya 2
  • 3. Today, I am going to introduce 5 papers about topic modeling. ● Indexing by Latent Semantic Analysis (Deerwester+, 1990) ● Probabilistic Latent Semantic Indexing (Hofmann, 1999) ● Latent Dirichlet Allocation (Blei+, 2003) ● Gaussian LDA for Topic Models with Word Embeddings (Das+, 2015) ● What is Wrong with Topic Modeling? (Agrawal+, 2018) 3
  • 4. Since I don't have enough time to introduce whole contents in each paper, I want to focus on 5 questions listed below. ● What is their motivation? ● What is the key point of their paper? ● How their model is? ● How to estimate parameters? ● What are deficiencies in their model? 4
  • 5. 5
  • 6. “a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.” (Wikipedia, “Topic model”, accessed on May 3, 2018) (岩田具治『トピックモデル』 , 2015, pp.vii) 6
  • 7. VSM is one of the most popular families of information retrieval techniques. VSM is characterised by three ingredients. 1. a transform function (also called local term weight such as term frequency) 2. a term weighting scheme (also called global term weight such as inverse term frequency) 3. a similarity measure such as cosine distance We represent a semantic distance as a spatial distance. Hofmann (1999). probabilistic latent semantic indexing, section 5.1 The vector space model for scoring 7
  • 8. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990) 8
  • 9. ● Deerwester belonged to Graduate Library School, University of Chicago. ● The aim of the study was to improve information retrieval system. ● They thought that there was a fundamental problem in existing retrieval techniques that try to match words of queries with words of documents. Scott Deerwester (1956-) 9
  • 10. If the query is “IDF in computer-based information look-up”, we think that document 1 and 3 are relevant. However, simple term matching method would return document 2 and 3. Document 1 would not be returned because of synonymy effect of look-up, and document 2 would be returned because of polysemy effect of information. access document retrieval information theory database indexing computer Doc1 1 1 1 1 1 Doc2 1 1 1 Doc3 1 1 1 10
  • 11. They introduced “semantic-space” wherein terms and documents that are closely associated are placed near one another. By using “semantic space” ● We can get rid of obscuring noise in data ● We can get conceptual content that users are really seeking 11
  • 12. Considering representational richness, explicit representation of both terms and documents, computational tractability, they proposed two-mode factor analysis, or Singular Value Decomposition. X T0 S0 D0 T documents terms t by d t by m m by m m by d 12
  • 13. Suppose that u’s is the eigenvectors of AAT , and v’s are the eigenvectors of AT A. Since those matrices are both symmetric, their eigenvectors can be chosen orthonormal. The simple fact that leads to It tells us that Considering V is an orthonormal matrix, it becomes Strang, Gilbert, et al. Introduction to linear algebra. Vol. 4. Wellesley, MA: Wellesley-Cambridge Press, 2009. 13
  • 14. ● It begins with arbitrary rectangular matrix (cf. one-mode factor analysis requires A to be square matrix ) ● It allows us to approximate original matrix using smaller matrices. It is important that the derived k-dimensional factor space does not reconstruct the original term space perfectly because it means getting rid of noise of original data (cf. python code for svd). 14
  • 15. 1. Using T and D, you construct semantic space 2. Find representations for the query following the equation below 3. Calculate cosine distance between query and documents 15
  • 16. ● Precision of the LSI method lies well above that obtained with term matching, SMART, and Voorhees. ● The average difference in precision between the LSI and the term matching method is .06 which represents a 13% improvement over raw term matching 16
  • 17. ● Exhaustive comparison of a query vector against all stored document vectors ● The initial SVD analysis is time consuming; it is hard to update. ● Lack of statistical foundation in latent factor “Roughly speaking, these factors may be thought of as artificial concepts” (Section 4.1) 17
  • 19. ● He belonged to International Computer Science Institute, Berkeley, CA. ● In order for computers to interact more naturally with humans, natural language queries are needed. ● Although LSA has been applied with remarkable success in different domains, it does not have satisfactory statistical foundation. Thomas Hofmann (1968-) 19
  • 20. He presented a novel approach to LSA that has a solid statistical foundation based on the likelihood principle and proper generative model of the data. ● He used a statistical latent class called “aspect model”. ● Model is annealed adjusted by minimizing word perplexity. 20
  • 21. Aspect Model is a latent variable model for general co-occurrence data which associates an unobserved class variable. 1. Select a document d with probability P(d) 2. Pick a latent class z with probability P(z|d) 3. Generate a word w with probability P(w|z) We call P(z|d) “aspects”. 21
  • 22. The probability of single word and document is written below. Hence, the joint probability of the whole data set is written as below. marzinalization product rule 22
  • 23. There are three parameters in the aspect model, which are p(z), p(w|z), p(d|z). We use Expectation Maximization (EM) algorithm to estimate them. EM algorithm is the standard procedure for maximum likelihood estimation in latent variable model. Before explaining EM algorithm, let me try normal maximum likelihood estimation,. 23
  • 24. Let n(d,w) the term frequency w in document d, likelihood of the model is Hence, log-likelihood is written as below. I want to maximize this log-likelihood, but log-sum structure in the equation is hard to differentiate. 24
  • 25. By Jensen’s inequality, By maximizing F using method of Lagrange multiplier, it leads 25
  • 26. 1. Initialize parameters p(z), p(w|z), p(z|d) 2. E-step Calculate p(z|d,w) using given parameters p(z), p(w|z), p(z|d) 3. M-step Update parameters p(z), p(w|z), p(z|d) using p(z|d,w) which has been just calculated 4. Re-calculate new log-likelihood. Until |new log-likelihood - old log-likelihood|<ε, we repeat E-step and M-step. Python code for PLSA implementation 26
  • 27. ● It is interesting to see that pLSA capture two different types of “flight” and “love” in their topics. It distinguish polysemy of the words. ● The experiments consistently validate the advantages of PLSI over LSI. 27
  • 28. Some points are derived from Blei et al., (2003). ● pLSA is likely to be overfitting ○ They use Tempered EM which is the improved version of EM algorithm to avoid overfitting, but it is not a fundamental solution. ● There is no statistical foundation at the level of documents. ○ In pLSA, each document is represented as the mixing proportions for topics, and there is no generative probabilistic model for these numbers. ○ This leads to the number of parameters in the model grows linearly with the size of the corpus, which leads to overfitting. ○ Plus, it is not clear how to assign probability to a document outside of the training set. 28
  • 29. David M. Blei, Andrew Y. Ng, Michael I. Jordan (2003) 29
  • 30. ● Blei was in the Computer Science Division, University of California. ● This paper consider the problem of modeling text corpora and other collections of discrete data. ● They thought pLSA is incomplete because it provides no probabilistic model at the level of documents. 30
  • 31. exchangeability of both words and documents ● LSI and pLSI are based on the “bag-of-words” assumption -- that the order of words in a document can be neglected, but less often stated that documents are exchangeable as well as words. ● de Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture distribution. It leads to the latent Dirichlet allocation that topics are infinitely exchangeable within a document. 31
  • 32. LDA assumes the following generative process for each document in a corpus. 1. Choose N ~ Poisson(ζ) 2. Choose θ ~ Dirichlet(α) 3. For each words a. Choose a topic z ~ Multinomial(θ) b. Choose a word w from p(w|z,β), a multinomial conditioned on the topic z Note that θ is a document leven variables sampled once per document. The joint distribution is given by 32
  • 33. Dirichlet distribution is conjugate to the multinomial distribution. Figure right shows dirichlet distribution at different α value. You will understand that it simply represents natural inference of human. For example if α = (5,2,2), θ1 would be high. https://cs.stanford.edu 33
  • 34. The key inferential problem is to compute the posterior distribution of the hidden variables given a document. Integrating over θ and summing over z, we obtain the marginal distribution of a document. This is intractable due to the coupling between θ and β in the summation over latent topics. Thus we apply an approximate inference to estimate parameters. 34
  • 35. 1. E-step: find optimizing values of the variational parameters γ, θ a. Variational Inference for γ and θ i. Initialize γ,θ ii. repeat until convergence 2. M-step: maximizing the lower bound on the log likelihood with respect to the model parameters α and β. source code for python 35
  • 36. ● LDA constantly performs better than other methods, unigram, mixture unigram and pLSI. ● For classification task, the performance is improved with LDA features. ● For collaborative filtering task, EachMovie, the best predictive performance was obtained by the LDA model. 36
  • 37. ● Order effect (cf. Agrawal et al., 2018) Different topics are generated if the training data is shuffled since its internal weights are updated via stochastic sampling process. Such effect introduce a systematic error for any study. ● Topic coherence (cf. Das et al., 2015) The prior preference for semantic coherence is not encoded in the model. Some topics can be accidental for human evaluations. ● Cannot handle out-of-vocabulary (OOV) words (cf. Das et al., 2015) 37
  • 38. 38
  • 39. Let us write the aspect model in matrix notation. Hence, define matrices by The joint probability model P can be a matrix product Although there is a fundamental difference between LSA and pLSA, pLSA can be seen as a dimensionality reduction method. 39
  • 40. M dimensional multinomial distribution can be represented as points on a M-1 dimensional simplex of all possible multinomial. Since the dimensionality of the sub-simplex (a probabilistic latent semantic space) K-1 as opposed to M-1 for the complete probability simplex, this can also be thought of dimensionality reduction. 40
  • 41. The topic simplex for three topics embedded in the word simplex for three words. The pLSI model induces an empirical distribution on the topic simplex denoted by x. LDA places a smooth distribution on the topic simplex denoted by the contour lines. 41
  • 42. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document You can easily see LDA assumes the generative model at the level of documents. (d) LDA model 42
  • 43. Rajarshi Das, Manzil Zaheer, Chris Dyer (2015) 43
  • 44. ● Das is a second year Ph.D student in School of Computer Science, Carnegie Mellon University. ● They want to propose a new technique for topic modeling by using word embeddings (Milkov, 2013) 44
  • 45. According to the distributional hypothesis, words occurring in similar contexts tend to have similar meaning. This has given rise to data-driven learning of word vectors that capture lexical and semantic properties (e.g. word2vec). we assume that rather than consisting of sequences of word types, documents consist of sequences of word embeddings. 45
  • 46. Since our observations are no longer discrete values but continuous vectors in an M-dimensional space, we characterize each topic k as a multivariate Gaussian distribution with mean μk and covariance ∑k . The generative process can thus be summarized as follows. 46
  • 47. 1. for k=1 to K a. Draw topic covariance b. Draw topic mean 2. for each document d in a corpus D a. Draw topic distribution b. for each word i. Draw a topic ii. Draw embedded vector 47
  • 48. We wish to infer the posterior distribution over the topic parameters, proportions and the topic assignments of individual words. We use a collapsed Gibbs sampler to infer them. We can make the sampling faster using Cholesky decomposition of covariance matrix. source code for python 48
  • 49. ● To measure topic coherence, we follow to compute Pointwise Mutual Information (PMI) of topic words. ● It can be seen that Gaussian LDA is a clear winner, achieving an average 275% higher score on average. 49
  • 50. ● we select a subset of documents and replace words of those documents by its synonyms if they haven’t occurred in the corpus before. ● Compared recently proposed extension of LDA that can handle unseen words (infvoc), Gaussian LDA performs better here, too. 50
  • 51. Amritanshu Agrawal, Wei Fu, Tim Menzies (2018) 51
  • 52. This is just summary of the paper. I didn’t have enough time to read this paper because I spent a lot of time to trying to understand parameter inference part of pLSA and LDA. Sorry for my unplanned. ● Motivation: the current great challenge in software analytics is understanding unstructured data. ● Key points: tuning proper parameters to fix “order effects” in LDA. ● Model: they propose LDADE, a search-based software engineering tool which uses Differential Evolution (DE) to tune the LDA’s parameter. ● Results: LDADE’s tunings dramatically reduce cluster instability and leads to improved performances for supervised as well as un-supervised learning 52
  • 53. ● Since 1990, topic modeling has been constantly needed although social background and researcher’s motivations have been changed. ● Topic modeling is easy to expand or add other probabilistic models. The model has been more complex as it becomes new. ● The way to estimate parameters has been evolving so that it can deal with more flexible models. ● It is very difficult for me to understand parameter inference part. I will need some mathematical trainings especially on optimization, which is based on linear algebra and probability theory. 53