Improving Topic Modeling with Knowledge Graph Embeddings

Improving Topic Modeling with
Knowledge Graph Embeddings
Marco Brambilla, Birant Altinel
marco.brambilla@polimi.it
marcobrambi

Formalizing new knowledge is hard
Only high frequency emerges
The long tail challenge

Key: Feature selection
To extract novel knowledge it’s crucial to find the appropriate way to
describe the source content. Features can be:
Syntactic
user profiles
tags, hashtags
BOW
Semantic
entity extraction
semantic features on images

• Topic Model: A statistical model that is used to discover the
abstract «latent» topics of a given content
• Example usage areas include information retrieval, classification,
collaborative filtering, …
• Most well known topic model is LDA
• Plate notation
Topics as new features: Why not?

• Topic Modeling relatively successful using pure statistical
approaches
• Unsupervised method of representing a corpus as a set
of topics (a distribution over a set of topics)
Topic Modeling

Edwin Chen
"Introduction to Latent Dirichlet Allocation" (2011)
Given the sentences
1.I like to eat broccoli and bananas.
2.I ate a banana and spinach smoothie for breakfast.
3.Chinchillas and kittens are cute.
4.My sister adopted a kitten yesterday.
5.Look at this cute hamster munching on a piece of broccoli.
LDA might produce something like
• Sentences 1 and 2: 100% Topic A
• Sentences 3 and 4: 100% Topic B
• Sentence 5: 60% Topic A, 40% Topic B
• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …

• Improve state-of-the-art of topic modeling by integrating
embedding methods over knowlegde graphs
• Explore possible extensions on the Knowledge Graph to create a
better structure for the knowledge embedding process
• Further explore the parametrization to clarify the effects of the
most relevant parameters on topic modeling
Objective

Background (1): Representation Learning
• Process of encoding knowledge into low-dimensional vectors
• Used for Machine Learning/Deep Learning tasks over graphs
• Supervised / Unsupervised
• Text Embedding is a RL that encodes textual content into vectors
composed of real numbers
• Graph Embeddings do the same on network models

A B
Background (2): Embedding Nodes
Find embedding of nodes to d-dimensions so that “similar” nodes in
the graph have embeddings that are close together.
OutputInput

Background (3): Knowledge Graphs
• Ontological representation of collected, structured and organized
information as a collective knowledge source. Explains real word
entities and relations between them.
• Examples: DBPedia, Freebase, WordNet, Google Knowledge
Graph.
• WordNet is an online lexical database for English language where
words are linked with semantic relations.

Embedding Methods on Knowledge Graphs
• TransE(2013) –Uses addition as the translation operator
• TransH(2014) –Extends TransE by modeling relations as
hyperplanes
• DistMult(2014) –Uses multiplication as the translation operator
• PTransE(2015) –Extends TransE with paths of multiple relations
• TransR(2015) –Extends TransE by creating separate semantic
spaces for entities and relations
• HolE(2016) –Uses correlation as the translation operator
• Analogy(2017) –Optimizes the representations with the analogical
properties of entities and relations

Embedding Methods Comparison
d: Embedding dimension, ne: Number of entities, nr: Number of relations, h: head entity,
r: relation, t: tail entity, wr: vector representation of r, p: path

Related Work
• KGE-LDA
• A knowledge-based topic model
• Combines LDA with entity
embeddings obtained from
knowledge graphs using TransE
• Proposes 2 models on how to
incorporate the embeddings into the
topic model
Model A
Model B

Our Experiment
• Text corpus: 20-Newsgroups (20NG) public dataset
• 18,800 documents
• 21K distinct words
• Wordnet18 as graph
• 115K triples
• 40K entities
• 18 types of relations

Parameter Exploration and Evaluation
• Topic Number
• Embedding Dimension
• Topic Coherence
A quantitative measure to evaluate the topic models by their coherence
• Document Classification Scores
The accuracy of the document classification through topic model’s output
features

Embedding Methods Comparison –
(some) Results

Extending the Graph
• dependency relations in sentences constitute meaningful
semantics by itself
• KG merged with the Dependency Graph

Knowledge Graph Extension
• Semantic relations in KG are merged with the syntactic
dependency relations obtained from sentences.

Our Experiment
• Wordnet18 as graph
• 115K triples
• 40K entities (only 9K in common with the dataset vocabulary)
• Extended graph size
• 815K triples

Knowledge Graph Extension – Results

Further parameter exploration
(embedding size = 100)

Execution Time: Embedding Dimension and
Topics
• The runtime duration with
respect to the topic number,
embedding dimension, and
incorporation model
• Topic Number has a higher
impact on runtime than the
embedding dimension

Conclusions
• First attempt to systematically integrate KBs in Topic analysis
• A content-based approach to extend the Knowledge Graph
transforming it in a domain specific network in order to improve
the embeddings.
• Parametrization extended (topic number and embedding
dimension)
Future:
• Grid-search for parameter optimization
• Improvement of knowledge graph extension process

THANKS!
QUESTIONS?
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Improving Topic Modeling with Knowledge Graph Embeddings

Recommended

Recommended

More Related Content

More from Marco Brambilla

More from Marco Brambilla (20)

Recently uploaded

Recently uploaded (20)

Improving Topic Modeling with Knowledge Graph Embeddings