Wikipedia infobox type_prediction_slides_dl4_k_gs

Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Wikipedia Infobox Type
Prediction using Embeddings
Russa Biswas, Rima Türker, Farshad Bakhshandegan-Moghaddam,
Maria Koutraki, Harald Sack

Motivation
● Wikipedia Infobox Type Information contributes to the RDF type in KGs such as DBpedia
Wikipedia Infobox Type DBpedia RDF Type
2

● Wikipedia 2016 version have 2000
infobox types and it follows the Zipf’s
law
● 70% of the Wikipedia infobox type are
missing!
3
Motivation

● Wikipedia 2016 version have 2000
infobox types and it follows the Zipf’s
law
● 70% of the Wikipedia infobox type are
missing!
4
Motivation
Challenges:
● Difficulty in assigning infobox types
● Incomplete
● Inconsistent

Goal
● Given a Wikipedia article the goal is to predict infobox type information
5

Goal
○ Effect of minimal information such as Table of Contents and named entities in the
prediction method is to be studied.
6

Goal
○ Effect of minimal information such as Table of Contents and named entities in the
prediction method is to be studied.
Approach
● Reduce the problem as a Text Classification Problem!
○ Word and Entity embeddings are used to generate the feature vectors
7

Features
● Abstract, Table of Contents and Named Entities from Abstract are used as features!
Abstract
Table of Contents
Named Entities
Infobox
8

Feature Vectors - Embeddings
● Abstract
○ Word vectors for each word from Google pre-trained Word2vec model
[Mikolav, T. 2013]
● Table of Contents
○ Word vectors for each word of the content from Google pre-trained Word2vec model
● Named Entities
○ Entity vectors for each entity in the abstract from DBpedia pre-trained RDF2vec model
[Ristoski, P., Paulheim, H., 2016]
9

Text Classification - Document Vector Generation
● Infobox Type Prediction problem can be reduced to Text Classification problem with labels
as types!
10

Text Classification - Document Vector Generation
as types!
11

Text Classification - Random Forest
as types!
L1
L2
L3
L3
….
Ln
Labels
Labels
L1
L2
L3
L3
….
Ln
Random
Forest
12

Text Classification - Convolutional Neural Network
A
member
of
the
Bush
family
13
Kim Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. 2014 Aug 25.
Filter Size =128, Sliding over a window of 3 words at a time

Experimental Setup - Dataset
14
● Wikipedia 2016 and DBpedia 2016 version are used
● 5000 articles per infobox type for the top 30 most popular inboxes
Pretrained Models
● Google pre-trained Word2vec model
○ 100 billion words from the Google news articles
○ Vector size = 300
● RDF2Vec [Ristoski, P., Paulheim, H., 2016]
○ DBpedia
○ Vector size = 200

Results
Feature Set
With Embedding Without Embedding TF-IDF
Random Forest
(CV)
Random Forest
(Split 80% - 20%)
CNN Random Forest
(CV)
Random Forest
(Split 80% - 20%)
Table of Contents 65% 65.8% 76.5% 38% 32.3%
Abstract 86% 86.4% 95.1% 80% 80.4%
Named Entities 45% 45.6% 62% 34% 28.9%
Table of Content +
Abstract
88% 88% 96.1% 83% 83.9%
Named Entities +
Abstract
86% 86.1% 78.4% 70% 71.2%
Table1 : Performance of classifiers using micro F1 score over different feature sets
15

Ground Truth
16
Manually generated ground truth for 30 types
Step 1 : Extract all the articles without infoboxes from Wikipedia version 2016
Step 2 : Check if these articles have an infobox type in Wikipedia version 2018
Data generated : 32000 articles in Wikipedia version 2018 is found to have new infobox types for
these top 30 types
Results on Unseen Data :
Feature Set Random Forest
TOC + Abstract 53.7%
Table2 : Performance of the classifier on ground truth measured in micro F1

Conclusion and Future Work
● Table of Contents contains less yet important information to predict the infobox type
information.
● Classification using embeddings perform better than the TF-IDF since it is able to capture
the semantic similarity of words.
● Embeddings of the named entities together with words using network embedding
algorithms such as LINE,PTE might improve the results of the classification process
17

Thank you for your attention!
18
Contacts
Homepage: https://www.fiz-karlsruhe.de/de/forschung/information-service-engineering/mitarbeiter-ise/russa-biswas.html
Email id : russa.biswas@fiz-karlsruhe.de
: russa_biswas

Wikipedia infobox type_prediction_slides_dl4_k_gs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Wikipedia infobox type_prediction_slides_dl4_k_gs

Similar to Wikipedia infobox type_prediction_slides_dl4_k_gs (20)

Recently uploaded

Recently uploaded (20)

Wikipedia infobox type_prediction_slides_dl4_k_gs