Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Slides: Concurrent Inference of Topic Models and Distributed Vector Representations
1. Concurrent Inference of Topic Models
and Distributed Vector Representations
Debakar Shamanta1, Sheikh Motahar Naim1, Parang Saraf2, Naren
Ramakrishnan2, and M. Shahriar Hossain2
1
Dept of CS, University of Texas at El Paso, El Paso, TX 79968;
2
Dept of CS, Virginia Tech, Arlington, VA 22203
Presented By:
Parang Saraf
2. Background - I
• A document collection comprises of different
elements
– Some elements are given like words, documents,
labels, etc.
– Some are hidden (latent) e.g. topics
• These elements can be represented with local
or distributed features (Neural Networks)
2
3. Background - II
• Local vs. Distributed Representations
– Local Representations
• Assigns each Neuron to represent one entity
• Ex: PKDD in Porto
• Representations (concatenation of vocabulary and
color vectors):
– [ 0/1 , 0/1, 0/1 , 0/1 , 0/1 , 0/1 ]
PKDD in Porto Red Blue Green
– PKDD : [ 1 0 0 1 0 0]
– in : [ 0 1 0 0 1 0]
– Porto: [ 0 0 1 0 0 1]
3
4. Background - III
• Local vs. Distributed Representations
– Distributed Representations
• Each Neuron represent one or more information
• Ex: PKDD in Porto
• Representations (concatenation of 2-bit vocabulary
and color vectors):
– [ 0/1, 0/1, 0/1 , 0/1 ]
– PKDD : [ 0 1 0 1]
– in : [ 1 0 1 0]
– Porto: [ 1 1 1 1]
4
5. Background - IV
• Distributed representation has better
generalization capabilities
– Each feature captures facts from entire dataset
Ref: Hinton, Geoffrey E. "Distributed representations." (1984).
5
6. Problem Statement - I
• So far in the literature we could achieve
distributed representation for labeled
elements.
– But what about inferred entities like Topics?
• Distributed representation for topics are
difficult to find since the topics are not
readily available
• We present a mechanism to generate
distributed representations of both given
and latent elements
6
7. Problem Statement - II
• But why do we need distributed
representations for both given and inferred
representations?
– So that we can represent them in the same
space
– Allows for comparison and all other types of
analysis
7
9. Word2Vec / Doc2Vec
• Tomas Mikolov et al. at Google released a
“shallow” Neural Network based model
that generates ‘better’ distributed word
representations by trading model’s
complexity for efficiency
– Requires learning from bigger dataset
– Trained on 100 billion words from Google
News Dataset
– Gensim has a python version of the code
– You can train on your own data
9
10. Word2Vec / Doc2Vec Insights
10
Ref: Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
11. Word2Vec / Doc2Vec Insights
11
Ref: Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Works only
with given
entities and
not with
inferred
ones
12. Proposed Solution
• We can do all of this
PLUS
• Generate similar meaningful
representations for inferred entities
– For example ‘topics’
– In the same space as words, documents,
labels etc.
12
13. Proposed Solution
• In this paper we propose a framework that
1. Determines topics of each document using neural
network
2. Simultaneously computes distributed representation
of topics in the same space as documents and words
3. Generates the distributed vectors using a smaller
number of dimensions than the actual text feature
space.
13
15. Evaluation Strategies
• Q1: Can our framework establish relationships
between distributed represents of topics and
documents?
• Q2: Are the generated topic vectors expressive
enough to capture similarity between topics and to
distinguish difference between them?
• Q3: How do our topic modeling results compare
with the results produced by other topic modeling
algorithms?
• Q4: Do the generated topics bring documents with
similar domain-specific themes together?
• Q5: How does the runtime of the proposed
framework scale?
15
17. Evaluation Question 1
• Question: Can our framework establish
relationships between distributed
representations of topics and documents?
• Topic–document relationships should be
similar for two documents of the same
topic as compared to documents from
different topics
17
18. Evaluation Question 1
• Given a topic vector Ti of topic ti, and a set of document
vectors Dtj that are assigned a topic tj, we compute
alignment using the following formula:
18
20. Evaluation Question 2
• Are the generated topic vectors expressive
enough to capture similarity between
topics and to distinguish differences
between them?
– Take the generated topic vectors and do
hierarchical clustering on them.
• Similar topics should appear close-by
20
22. Evaluation Question 3
• How do our topic modeling results
compare with the results produced by
other topic modeling algorithms?
– Compare with LDA, and
– NTM : Close resemblance to our work. Works
with pre-computed vectors
Ref: Cao, Ziqiang, et al. "A Novel Neural Topic Model and Its Supervised Extension."
Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
22
23. Evaluation Question 3
• Evaluation methods used to evaluate
clustering results when ground truth labels
are available:
– Adjusted Rand Index (ARI)
• Estimates the agreement between two topic
assignments
• Higher values are better
– Normalized Mutual Information (NMI)
• Estimates the agreement between two topic
assignments
• Higher values are better
23
25. Evaluation Question 3
• Evaluation Methods used to evaluate
clustering results when ground truth labels
are not available:
– Dunn Index (DI):
• measures the separation between groups of
vectors
• Larger values are better
– Average Silhouette Coefficient (ASC)
• Measures both cohesion and separation of groups
• Higher values are better
25
26. Evaluation Question 3
26
NTM Missing: Both Dunn Index and Average Silhouette coefficient require
document vectors but NTM doesn’t use any document vectors, rather it uses
only pre-computed word vectors
27. Evaluation Question 4
• Do the generated topics bring documents
with similar domain-specific themes
together?
• Use Pub-Med dataset that comes with
MeSH terms
– It is expected that two documents on same
topic will have more common MeSH terms as
compared to documents on different topics
27
28. Evaluation Question 4
28
Pick top n meSH terms for two documents:
1. Similar topic documents: common meSH terms increase with larger n
2. Different topic documents: higher absence of overlapping terms with
smaller n
29. Evaluation Question 5
• How does the runtime of the proposed
framework scale with the size of the
– distributed representations
– Increasing number of documents
– Increasing number of topics
29
31. In Summary
• Framework generates distributed
representations for both given as well as
inferred entities
• Generating representations in the same
hyperspace for both given and hidden
entities is crucial:
– Opens door for performing different types of
analysis
31
32. Take it for a Spin !
• Data and software source codes are
available here:
http://dal.cs.utep.edu/projects/tvec/
32