Topic modeling techniques have been applied in many scenarios
in recent years, spanning textual content, as well as many
different data sources. The existing researches in this field
continuously try to improve the accuracy and coherence of
the results. Some recent works propose new methods that capture
the semantic relations between words into the topic modeling
process, by employing vector embeddings over knowledge
bases.
In our recent paper presented at the AAAI Spring Symposium 2019, held at Stanford University, we studied how knowledge graph embeddings affect topic modeling performance
on textual content. In particular, the objective of the
work is to determine which aspects of knowledge graph embedding
have a significant and positive impact on the accuracy
of the extracted topics.
We improve the state of the art by integrating some advanced graph embedding approaches (specifically designed for knowledge graphs) within the topic extraction process.
We also studied how the knowledge base could be expanded by using dataset-specific relations between the words.
We implemented the method and we validated it with
a set of experiments with 2 variations of the knowledge
base, 7 embedding methods, and 2 methods for incorporation
of the embeddings into the topic modeling framework, also
considering different parametrizations of topic number and embedding
dimensionality.
Besides the specific technical results, the work has also aims at showing the potentials of integrating statistical methods with knowledge-centric methods. The full extent of the impact of these techniques shall be explored further in the future.
The details of the work are reported in the paper, which is available online here, and in the slides, also available online (on SlideShare).
3. Key: Feature selection
To extract novel knowledge it’s crucial to find the appropriate way to
describe the source content. Features can be:
Syntactic
user profiles
tags, hashtags
BOW
Semantic
entity extraction
semantic features on images
4. • Topic Model: A statistical model that is used to discover the
abstract «latent» topics of a given content
• Example usage areas include information retrieval, classification,
collaborative filtering, …
• Most well known topic model is LDA
• Plate notation
Topics as new features: Why not?
5. • Topic Modeling relatively successful using pure statistical
approaches
• Unsupervised method of representing a corpus as a set
of topics (a distribution over a set of topics)
Topic Modeling
6. Edwin Chen
"Introduction to Latent Dirichlet Allocation" (2011)
Given the sentences
1.I like to eat broccoli and bananas.
2.I ate a banana and spinach smoothie for breakfast.
3.Chinchillas and kittens are cute.
4.My sister adopted a kitten yesterday.
5.Look at this cute hamster munching on a piece of broccoli.
LDA might produce something like
• Sentences 1 and 2: 100% Topic A
• Sentences 3 and 4: 100% Topic B
• Sentence 5: 60% Topic A, 40% Topic B
• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
7. • Improve state-of-the-art of topic modeling by integrating
embedding methods over knowlegde graphs
• Explore possible extensions on the Knowledge Graph to create a
better structure for the knowledge embedding process
• Further explore the parametrization to clarify the effects of the
most relevant parameters on topic modeling
Objective
8. Background (1): Representation Learning
• Process of encoding knowledge into low-dimensional vectors
• Used for Machine Learning/Deep Learning tasks over graphs
• Supervised / Unsupervised
• Text Embedding is a RL that encodes textual content into vectors
composed of real numbers
• Graph Embeddings do the same on network models
9. A B
Background (2): Embedding Nodes
Find embedding of nodes to d-dimensions so that “similar” nodes in
the graph have embeddings that are close together.
OutputInput
10. Background (3): Knowledge Graphs
• Ontological representation of collected, structured and organized
information as a collective knowledge source. Explains real word
entities and relations between them.
• Examples: DBPedia, Freebase, WordNet, Google Knowledge
Graph.
• WordNet is an online lexical database for English language where
words are linked with semantic relations.
11. Embedding Methods on Knowledge Graphs
• TransE(2013) –Uses addition as the translation operator
• TransH(2014) –Extends TransE by modeling relations as
hyperplanes
• DistMult(2014) –Uses multiplication as the translation operator
• PTransE(2015) –Extends TransE with paths of multiple relations
• TransR(2015) –Extends TransE by creating separate semantic
spaces for entities and relations
• HolE(2016) –Uses correlation as the translation operator
• Analogy(2017) –Optimizes the representations with the analogical
properties of entities and relations
12. Embedding Methods Comparison
d: Embedding dimension, ne: Number of entities, nr: Number of relations, h: head entity,
r: relation, t: tail entity, wr: vector representation of r, p: path
13. Related Work
• KGE-LDA
• A knowledge-based topic model
• Combines LDA with entity
embeddings obtained from
knowledge graphs using TransE
• Proposes 2 models on how to
incorporate the embeddings into the
topic model
Model A
Model B
14. Our Experiment
• Text corpus: 20-Newsgroups (20NG) public dataset
• 18,800 documents
• 21K distinct words
• Wordnet18 as graph
• 115K triples
• 40K entities
• 18 types of relations
15. Parameter Exploration and Evaluation
• Topic Number
• Embedding Dimension
• Topic Coherence
A quantitative measure to evaluate the topic models by their coherence
• Document Classification Scores
The accuracy of the document classification through topic model’s output
features
23. Execution Time: Embedding Dimension and
Topics
• The runtime duration with
respect to the topic number,
embedding dimension, and
incorporation model
• Topic Number has a higher
impact on runtime than the
embedding dimension
24. Conclusions
• First attempt to systematically integrate KBs in Topic analysis
• A content-based approach to extend the Knowledge Graph
transforming it in a domain specific network in order to improve
the embeddings.
• Parametrization extended (topic number and embedding
dimension)
Future:
• Grid-search for parameter optimization
• Improvement of knowledge graph extension process