2. ● I'm 1st grade on my Ph.D course
in Physical and Health
Education.
● Despite of the name of my
course, I am currently working on
learning analytics of
research-based active learning.
● The data I have to analysis are
often text-format. That's why I
attend this class.
Hiroyuki Kuromiya
2
3. Today, I am going to introduce 5 papers about topic modeling.
● Indexing by Latent Semantic Analysis (Deerwester+, 1990)
● Probabilistic Latent Semantic Indexing (Hofmann, 1999)
● Latent Dirichlet Allocation (Blei+, 2003)
● Gaussian LDA for Topic Models with Word Embeddings (Das+, 2015)
● What is Wrong with Topic Modeling? (Agrawal+, 2018)
3
4. Since I don't have enough time to introduce whole contents in each paper, I want
to focus on 5 questions listed below.
● What is their motivation?
● What is the key point of their paper?
● How their model is?
● How to estimate parameters?
● What are deficiencies in their model?
4
6. “a topic model is a type of statistical
model for discovering the abstract
"topics" that occur in a collection of
documents.”
(Wikipedia, “Topic model”, accessed on
May 3, 2018)
(岩田具治『トピックモデル』 , 2015, pp.vii)
6
7. VSM is one of the most popular families of information retrieval techniques.
VSM is characterised by three ingredients.
1. a transform function (also called local term weight such as term frequency)
2. a term weighting scheme (also called global term weight such as inverse term
frequency)
3. a similarity measure such as cosine distance
We represent a semantic distance as a spatial distance.
Hofmann (1999). probabilistic latent semantic indexing, section 5.1
The vector space model for scoring
7
9. ● Deerwester belonged to Graduate Library
School, University of Chicago.
● The aim of the study was to improve
information retrieval system.
● They thought that there was a fundamental
problem in existing retrieval techniques that
try to match words of queries with words of
documents.
Scott Deerwester (1956-)
9
10. If the query is “IDF in computer-based information look-up”, we think that
document 1 and 3 are relevant. However, simple term matching method would
return document 2 and 3.
Document 1 would not be returned because of synonymy effect of look-up, and
document 2 would be returned because of polysemy effect of information.
access document retrieval information theory database indexing computer
Doc1 1 1 1 1 1
Doc2 1 1 1
Doc3 1 1 1
10
11. They introduced “semantic-space” wherein terms
and documents that are closely associated are
placed near one another. By using “semantic space”
● We can get rid of obscuring noise in data
● We can get conceptual content that users are
really seeking
11
12. Considering representational richness, explicit representation of both terms and
documents, computational tractability, they proposed two-mode factor analysis, or
Singular Value Decomposition.
X T0
S0 D0
T
documents
terms
t by d t by m
m by m m by d
12
13. Suppose that u’s is the eigenvectors of AAT
, and v’s are the eigenvectors of AT
A.
Since those matrices are both symmetric, their eigenvectors can be chosen
orthonormal.
The simple fact that
leads to
It tells us that
Considering V is an orthonormal matrix, it becomes
Strang, Gilbert, et al. Introduction to linear algebra. Vol. 4. Wellesley, MA: Wellesley-Cambridge Press, 2009.
13
14. ● It begins with arbitrary rectangular matrix (cf. one-mode factor analysis
requires A to be square matrix )
● It allows us to approximate original matrix using smaller matrices.
It is important that the derived k-dimensional factor space does not
reconstruct the original term space perfectly because it means getting rid of
noise of original data (cf. python code for svd).
14
15. 1. Using T and D, you construct
semantic space
2. Find representations for the query
following the equation below
3. Calculate cosine distance between
query and documents
15
16. ● Precision of the LSI method lies well
above that obtained with term
matching, SMART, and Voorhees.
● The average difference in precision
between the LSI and the term
matching method is .06 which
represents a 13% improvement over
raw term matching
16
17. ● Exhaustive comparison of a query vector
against all stored document vectors
● The initial SVD analysis is time consuming;
it is hard to update.
● Lack of statistical foundation in latent
factor
“Roughly speaking, these
factors may be thought
of as artificial concepts”
(Section 4.1)
17
19. ● He belonged to International
Computer Science Institute,
Berkeley, CA.
● In order for computers to interact
more naturally with humans, natural
language queries are needed.
● Although LSA has been applied
with remarkable success in
different domains, it does not have
satisfactory statistical foundation.
Thomas Hofmann (1968-)
19
20. He presented a novel approach to LSA that has a solid statistical foundation
based on the likelihood principle and proper generative model of the data.
● He used a statistical latent class called “aspect model”.
● Model is annealed adjusted by minimizing word perplexity.
20
21. Aspect Model is a latent variable model for general co-occurrence data which
associates an unobserved class variable.
1. Select a document d with probability P(d)
2. Pick a latent class z with probability P(z|d)
3. Generate a word w with probability P(w|z)
We call P(z|d) “aspects”.
21
22. The probability of single word and document is written below.
Hence, the joint probability of the whole data set is written as below.
marzinalization product rule
22
23. There are three parameters in the aspect model, which are p(z), p(w|z), p(d|z).
We use Expectation Maximization (EM) algorithm to estimate them.
EM algorithm is the standard procedure for maximum likelihood estimation in
latent variable model.
Before explaining EM algorithm, let me try normal maximum likelihood estimation,.
23
24. Let n(d,w) the term frequency w in document d, likelihood of the model is
Hence, log-likelihood is written as below.
I want to maximize this log-likelihood, but log-sum structure in the equation is
hard to differentiate.
24
26. 1. Initialize parameters p(z), p(w|z), p(z|d)
2. E-step
Calculate p(z|d,w) using given parameters p(z), p(w|z), p(z|d)
3. M-step
Update parameters p(z), p(w|z), p(z|d) using p(z|d,w) which has been just
calculated
4. Re-calculate new log-likelihood. Until |new log-likelihood - old log-likelihood|<ε,
we repeat E-step and M-step.
Python code for PLSA implementation
26
27. ● It is interesting to see that pLSA
capture two different types of “flight”
and “love” in their topics. It
distinguish polysemy of the words.
● The experiments consistently
validate the advantages of PLSI over
LSI.
27
28. Some points are derived from Blei et al., (2003).
● pLSA is likely to be overfitting
○ They use Tempered EM which is the improved version of EM algorithm to avoid overfitting, but
it is not a fundamental solution.
● There is no statistical foundation at the level of documents.
○ In pLSA, each document is represented as the mixing proportions for topics, and there is no
generative probabilistic model for these numbers.
○ This leads to the number of parameters in the model grows linearly with the size of the
corpus, which leads to overfitting.
○ Plus, it is not clear how to assign probability to a document outside of the training set.
28
29. David M. Blei, Andrew Y. Ng, Michael I. Jordan (2003)
29
30. ● Blei was in the Computer Science Division,
University of California.
● This paper consider the problem of modeling
text corpora and other collections of
discrete data.
● They thought pLSA is incomplete because it
provides no probabilistic model at the level
of documents.
30
31. exchangeability of both words and documents
● LSI and pLSI are based on the “bag-of-words” assumption -- that the order of
words in a document can be neglected, but less often stated that documents
are exchangeable as well as words.
● de Finetti (1990) establishes that any collection of exchangeable random
variables has a representation as a mixture distribution.
It leads to the latent Dirichlet allocation that topics are infinitely exchangeable
within a document.
31
32. LDA assumes the following generative process for each document in a corpus.
1. Choose N ~ Poisson(ζ)
2. Choose θ ~ Dirichlet(α)
3. For each words
a. Choose a topic z ~ Multinomial(θ)
b. Choose a word w from p(w|z,β), a multinomial conditioned on the topic z
Note that θ is a document leven variables sampled once per document.
The joint distribution is given by
32
33. Dirichlet distribution is conjugate to the
multinomial distribution.
Figure right shows dirichlet distribution at
different α value. You will understand that it
simply represents natural inference of human.
For example if α = (5,2,2), θ1 would be high.
https://cs.stanford.edu
33
34. The key inferential problem is to compute the posterior distribution of the hidden
variables given a document.
Integrating over θ and summing over z, we obtain the marginal distribution of a
document.
This is intractable due to the coupling between θ and β in the summation over
latent topics. Thus we apply an approximate inference to estimate parameters.
34
35. 1. E-step: find optimizing values of the variational parameters γ, θ
a. Variational Inference for γ and θ
i. Initialize γ,θ
ii. repeat until convergence
2. M-step: maximizing the lower bound on the log likelihood with respect to the
model parameters α and β.
source code for python
35
36. ● LDA constantly performs better than
other methods, unigram, mixture
unigram and pLSI.
● For classification task, the
performance is improved with LDA
features.
● For collaborative filtering task,
EachMovie, the best predictive
performance was obtained by the
LDA model.
36
37. ● Order effect (cf. Agrawal et al., 2018)
Different topics are generated if the training data is shuffled since its internal
weights are updated via stochastic sampling process. Such effect introduce a
systematic error for any study.
● Topic coherence (cf. Das et al., 2015)
The prior preference for semantic coherence is not encoded in the model.
Some topics can be accidental for human evaluations.
● Cannot handle out-of-vocabulary (OOV) words (cf. Das et al., 2015)
37
39. Let us write the aspect model in matrix notation. Hence, define matrices by
The joint probability model P can be a matrix product
Although there is a fundamental difference between LSA and pLSA, pLSA can be
seen as a dimensionality reduction method.
39
40. M dimensional multinomial distribution
can be represented as points on a M-1
dimensional simplex of all possible
multinomial.
Since the dimensionality of the
sub-simplex (a probabilistic latent
semantic space) K-1 as opposed to M-1
for the complete probability simplex, this
can also be thought of dimensionality
reduction.
40
41. The topic simplex for three topics
embedded in the word simplex for three
words.
The pLSI model induces an empirical
distribution on the topic simplex denoted
by x. LDA places a smooth distribution
on the topic simplex denoted by the
contour lines.
41
42. The boxes are “plates” representing
replicates. The outer plate represents
documents, while the inner plate
represents the repeated choice of
topics and words within a document
You can easily see LDA assumes the
generative model at the level of
documents.
(d) LDA model
42
44. ● Das is a second year Ph.D student in School
of Computer Science, Carnegie Mellon
University.
● They want to propose a new technique for
topic modeling by using word embeddings
(Milkov, 2013)
44
45. According to the distributional hypothesis,
words occurring in similar contexts tend to have
similar meaning.
This has given rise to data-driven learning of
word vectors that capture lexical and semantic
properties (e.g. word2vec).
we assume that rather than consisting of
sequences of word types, documents consist of
sequences of word embeddings.
45
46. Since our observations are no longer discrete values but continuous vectors in an
M-dimensional space, we characterize each topic k as a multivariate Gaussian
distribution with mean μk
and covariance ∑k
.
The generative process can thus be summarized as follows.
46
47. 1. for k=1 to K
a. Draw topic covariance
b. Draw topic mean
2. for each document d in a corpus D
a. Draw topic distribution
b. for each word
i. Draw a topic
ii. Draw embedded vector
47
48. We wish to infer the posterior distribution over the topic parameters, proportions
and the topic assignments of individual words.
We use a collapsed Gibbs sampler to infer them.
We can make the sampling faster using Cholesky decomposition of covariance
matrix.
source code for python
48
49. ● To measure topic coherence, we
follow to compute Pointwise
Mutual Information (PMI) of topic
words.
● It can be seen that Gaussian LDA is
a clear winner, achieving an
average 275% higher score on
average.
49
50. ● we select a subset of documents and
replace words of those documents by its
synonyms if they haven’t occurred in the
corpus before.
● Compared recently proposed extension
of LDA that can handle unseen words
(infvoc), Gaussian LDA performs better
here, too.
50
52. This is just summary of the paper. I didn’t have enough time to read this paper
because I spent a lot of time to trying to understand parameter inference part of
pLSA and LDA. Sorry for my unplanned.
● Motivation: the current great challenge in software analytics is understanding
unstructured data.
● Key points: tuning proper parameters to fix “order effects” in LDA.
● Model: they propose LDADE, a search-based software engineering tool which
uses Differential Evolution (DE) to tune the LDA’s parameter.
● Results: LDADE’s tunings dramatically reduce cluster instability and leads to
improved performances for supervised as well as un-supervised learning
52
53. ● Since 1990, topic modeling has been constantly needed although social
background and researcher’s motivations have been changed.
● Topic modeling is easy to expand or add other probabilistic models. The
model has been more complex as it becomes new.
● The way to estimate parameters has been evolving so that it can deal with
more flexible models.
● It is very difficult for me to understand parameter inference part. I will need
some mathematical trainings especially on optimization, which is based on
linear algebra and probability theory.
53