1. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Wikipedia Infobox Type
Prediction using Embeddings
Russa Biswas, Rima Türker, Farshad Bakhshandegan-Moghaddam,
Maria Koutraki, Harald Sack
2. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Motivation
● Wikipedia Infobox Type Information contributes to the RDF type in KGs such as DBpedia
Wikipedia Infobox Type DBpedia RDF Type
2
3. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
● Wikipedia 2016 version have 2000
infobox types and it follows the Zipf’s
law
● 70% of the Wikipedia infobox type are
missing!
3
Motivation
4. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
● Wikipedia 2016 version have 2000
infobox types and it follows the Zipf’s
law
● 70% of the Wikipedia infobox type are
missing!
4
Motivation
Challenges:
● Difficulty in assigning infobox types
● Incomplete
● Inconsistent
5. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Goal
● Given a Wikipedia article the goal is to predict infobox type information
5
6. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Goal
● Given a Wikipedia article the goal is to predict infobox type information
○ Effect of minimal information such as Table of Contents and named entities in the
prediction method is to be studied.
6
7. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Goal
● Given a Wikipedia article the goal is to predict infobox type information
○ Effect of minimal information such as Table of Contents and named entities in the
prediction method is to be studied.
Approach
● Reduce the problem as a Text Classification Problem!
○ Word and Entity embeddings are used to generate the feature vectors
7
8. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Features
● Abstract, Table of Contents and Named Entities from Abstract are used as features!
Abstract
Table of Contents
Named Entities
Infobox
8
9. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Feature Vectors - Embeddings
● Abstract
○ Word vectors for each word from Google pre-trained Word2vec model
[Mikolav, T. 2013]
● Table of Contents
○ Word vectors for each word of the content from Google pre-trained Word2vec model
● Named Entities
○ Entity vectors for each entity in the abstract from DBpedia pre-trained RDF2vec model
[Ristoski, P., Paulheim, H., 2016]
9
10. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Document Vector Generation
● Infobox Type Prediction problem can be reduced to Text Classification problem with labels
as types!
10
11. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Document Vector Generation
● Infobox Type Prediction problem can be reduced to Text Classification problem with labels
as types!
11
12. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Random Forest
● Infobox Type Prediction problem can be reduced to Text Classification problem with labels
as types!
L1
L2
L3
L3
….
Ln
Labels
Labels
L1
L2
L3
L3
….
Ln
Random
Forest
12
13. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Convolutional Neural Network
A
member
of
the
Bush
family
13
Kim Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. 2014 Aug 25.
Filter Size =128, Sliding over a window of 3 words at a time
14. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Experimental Setup - Dataset
14
● Wikipedia 2016 and DBpedia 2016 version are used
● 5000 articles per infobox type for the top 30 most popular inboxes
Pretrained Models
● Google pre-trained Word2vec model
○ 100 billion words from the Google news articles
○ Vector size = 300
● RDF2Vec [Ristoski, P., Paulheim, H., 2016]
○ DBpedia
○ Vector size = 200
15. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Results
Feature Set
With Embedding Without Embedding TF-IDF
Random Forest
(CV)
Random Forest
(Split 80% - 20%)
CNN Random Forest
(CV)
Random Forest
(Split 80% - 20%)
Table of Contents 65% 65.8% 76.5% 38% 32.3%
Abstract 86% 86.4% 95.1% 80% 80.4%
Named Entities 45% 45.6% 62% 34% 28.9%
Table of Content +
Abstract
88% 88% 96.1% 83% 83.9%
Named Entities +
Abstract
86% 86.1% 78.4% 70% 71.2%
Table1 : Performance of classifiers using micro F1 score over different feature sets
15
16. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Ground Truth
16
Manually generated ground truth for 30 types
Step 1 : Extract all the articles without infoboxes from Wikipedia version 2016
Step 2 : Check if these articles have an infobox type in Wikipedia version 2018
Data generated : 32000 articles in Wikipedia version 2018 is found to have new infobox types for
these top 30 types
Results on Unseen Data :
Feature Set Random Forest
TOC + Abstract 53.7%
Table2 : Performance of the classifier on ground truth measured in micro F1
17. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Conclusion and Future Work
● Table of Contents contains less yet important information to predict the infobox type
information.
● Classification using embeddings perform better than the TF-IDF since it is able to capture
the semantic similarity of words.
● Embeddings of the named entities together with words using network embedding
algorithms such as LINE,PTE might improve the results of the classification process
17
18. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Thank you for your attention!
18
Contacts
Homepage: https://www.fiz-karlsruhe.de/de/forschung/information-service-engineering/mitarbeiter-ise/russa-biswas.html
Email id : russa.biswas@fiz-karlsruhe.de
: russa_biswas