SlideShare a Scribd company logo
1 of 18
Download to read offline
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Wikipedia Infobox Type
Prediction using Embeddings
Russa Biswas, Rima Türker, Farshad Bakhshandegan-Moghaddam,
Maria Koutraki, Harald Sack
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Motivation
● Wikipedia Infobox Type Information contributes to the RDF type in KGs such as DBpedia
Wikipedia Infobox Type DBpedia RDF Type
2
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
● Wikipedia 2016 version have 2000
infobox types and it follows the Zipf’s
law
● 70% of the Wikipedia infobox type are
missing!
3
Motivation
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
● Wikipedia 2016 version have 2000
infobox types and it follows the Zipf’s
law
● 70% of the Wikipedia infobox type are
missing!
4
Motivation
Challenges:
● Difficulty in assigning infobox types
● Incomplete
● Inconsistent
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Goal
● Given a Wikipedia article the goal is to predict infobox type information
5
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Goal
● Given a Wikipedia article the goal is to predict infobox type information
○ Effect of minimal information such as Table of Contents and named entities in the
prediction method is to be studied.
6
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Goal
● Given a Wikipedia article the goal is to predict infobox type information
○ Effect of minimal information such as Table of Contents and named entities in the
prediction method is to be studied.
Approach
● Reduce the problem as a Text Classification Problem!
○ Word and Entity embeddings are used to generate the feature vectors
7
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Features
● Abstract, Table of Contents and Named Entities from Abstract are used as features!
Abstract
Table of Contents
Named Entities
Infobox
8
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Feature Vectors - Embeddings
● Abstract
○ Word vectors for each word from Google pre-trained Word2vec model
[Mikolav, T. 2013]
● Table of Contents
○ Word vectors for each word of the content from Google pre-trained Word2vec model
● Named Entities
○ Entity vectors for each entity in the abstract from DBpedia pre-trained RDF2vec model
[Ristoski, P., Paulheim, H., 2016]
9
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Document Vector Generation
● Infobox Type Prediction problem can be reduced to Text Classification problem with labels
as types!
10
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Document Vector Generation
● Infobox Type Prediction problem can be reduced to Text Classification problem with labels
as types!
11
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Random Forest
● Infobox Type Prediction problem can be reduced to Text Classification problem with labels
as types!
L1
L2
L3
L3
….
Ln
Labels
Labels
L1
L2
L3
L3
….
Ln
Random
Forest
12
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Text Classification - Convolutional Neural Network
A
member
of
the
Bush
family
13
Kim Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. 2014 Aug 25.
Filter Size =128, Sliding over a window of 3 words at a time
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Experimental Setup - Dataset
14
● Wikipedia 2016 and DBpedia 2016 version are used
● 5000 articles per infobox type for the top 30 most popular inboxes
Pretrained Models
● Google pre-trained Word2vec model
○ 100 billion words from the Google news articles
○ Vector size = 300
● RDF2Vec [Ristoski, P., Paulheim, H., 2016]
○ DBpedia
○ Vector size = 200
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Results
Feature Set
With Embedding Without Embedding TF-IDF
Random Forest
(CV)
Random Forest
(Split 80% - 20%)
CNN Random Forest
(CV)
Random Forest
(Split 80% - 20%)
Table of Contents 65% 65.8% 76.5% 38% 32.3%
Abstract 86% 86.4% 95.1% 80% 80.4%
Named Entities 45% 45.6% 62% 34% 28.9%
Table of Content +
Abstract
88% 88% 96.1% 83% 83.9%
Named Entities +
Abstract
86% 86.1% 78.4% 70% 71.2%
Table1 : Performance of classifiers using micro F1 score over different feature sets
15
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Ground Truth
16
Manually generated ground truth for 30 types
Step 1 : Extract all the articles without infoboxes from Wikipedia version 2016
Step 2 : Check if these articles have an infobox type in Wikipedia version 2018
Data generated : 32000 articles in Wikipedia version 2018 is found to have new infobox types for
these top 30 types
Results on Unseen Data :
Feature Set Random Forest
TOC + Abstract 53.7%
Table2 : Performance of the classifier on ground truth measured in micro F1
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Conclusion and Future Work
● Table of Contents contains less yet important information to predict the infobox type
information.
● Classification using embeddings perform better than the TF-IDF since it is able to capture
the semantic similarity of words.
● Embeddings of the named entities together with words using network embedding
algorithms such as LINE,PTE might improve the results of the classification process
17
Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe
Thank you for your attention!
18
Contacts
Homepage: https://www.fiz-karlsruhe.de/de/forschung/information-service-engineering/mitarbeiter-ise/russa-biswas.html
Email id : russa.biswas@fiz-karlsruhe.de
: russa_biswas

More Related Content

What's hot

RDF and Drupal - The Semantic web
RDF and Drupal - The Semantic webRDF and Drupal - The Semantic web
RDF and Drupal - The Semantic web
gauravkumar87
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
Mariano Rodriguez-Muro
 
The Future of Search and SEO in Drupal
The Future of Search and SEO in DrupalThe Future of Search and SEO in Drupal
The Future of Search and SEO in Drupal
scorlosquet
 
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
Josef Petrák
 

What's hot (20)

Tickery, Pyjamas and FluidDB
Tickery, Pyjamas and FluidDBTickery, Pyjamas and FluidDB
Tickery, Pyjamas and FluidDB
 
Debunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative FactsDebunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative Facts
 
What's New in RDF 1.1?
What's New in RDF 1.1?What's New in RDF 1.1?
What's New in RDF 1.1?
 
SWT Lecture Session 8 - Rules
SWT Lecture Session 8 - RulesSWT Lecture Session 8 - Rules
SWT Lecture Session 8 - Rules
 
EDRAK: Entity-centric Data Resource for Arabic Knowledge
EDRAK: Entity-centric Data Resource for Arabic KnowledgeEDRAK: Entity-centric Data Resource for Arabic Knowledge
EDRAK: Entity-centric Data Resource for Arabic Knowledge
 
RDF and Drupal - The Semantic web
RDF and Drupal - The Semantic webRDF and Drupal - The Semantic web
RDF and Drupal - The Semantic web
 
ALEC (A List of Everything Cool)
ALEC (A List of Everything Cool)ALEC (A List of Everything Cool)
ALEC (A List of Everything Cool)
 
DBpedia as Gaeilge Chapter
DBpedia as Gaeilge ChapterDBpedia as Gaeilge Chapter
DBpedia as Gaeilge Chapter
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
 
Semantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynkSemantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynk
 
2013 CrossRef Annual Meeting Flash Update CrossRef Metadata Search Karl Ward
2013 CrossRef Annual Meeting Flash Update CrossRef Metadata Search Karl Ward2013 CrossRef Annual Meeting Flash Update CrossRef Metadata Search Karl Ward
2013 CrossRef Annual Meeting Flash Update CrossRef Metadata Search Karl Ward
 
RDF, linked data and semantic web
RDF, linked data and semantic webRDF, linked data and semantic web
RDF, linked data and semantic web
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
2016.02 - Validating RDF Data Quality using Constraints to Direct the Develop...
2016.02 - Validating RDF Data Quality using Constraints to Direct the Develop...2016.02 - Validating RDF Data Quality using Constraints to Direct the Develop...
2016.02 - Validating RDF Data Quality using Constraints to Direct the Develop...
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Consuming Linked Data by Machines - WWW2010
Consuming Linked Data by Machines - WWW2010Consuming Linked Data by Machines - WWW2010
Consuming Linked Data by Machines - WWW2010
 
The Future of Search and SEO in Drupal
The Future of Search and SEO in DrupalThe Future of Search and SEO in Drupal
The Future of Search and SEO in Drupal
 
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
 

Similar to Wikipedia infobox type_prediction_slides_dl4_k_gs

The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
Sören Auer
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
Kevin Lee
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
giurca
 

Similar to Wikipedia infobox type_prediction_slides_dl4_k_gs (20)

Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
Free For All: Getting Started in Open Source
Free For All: Getting Started in Open SourceFree For All: Getting Started in Open Source
Free For All: Getting Started in Open Source
 
Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)
Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)
Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
 
Telling the World and Our Users What We Have
Telling the World and Our Users What We HaveTelling the World and Our Users What We Have
Telling the World and Our Users What We Have
 
Loops of humans and bots in Wikidata
Loops of humans and bots in WikidataLoops of humans and bots in Wikidata
Loops of humans and bots in Wikidata
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending Benefits
 
Analysing Structured Scholarly Data Embedded in Web Pages
Analysing Structured Scholarly Data Embedded in Web PagesAnalysing Structured Scholarly Data Embedded in Web Pages
Analysing Structured Scholarly Data Embedded in Web Pages
 
SPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesSPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queries
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
 
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
 
Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
 
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionLearning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
 
Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed
 
COPO kick-off meeting
COPO kick-off meetingCOPO kick-off meeting
COPO kick-off meeting
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 

Recently uploaded

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Wikipedia infobox type_prediction_slides_dl4_k_gs

  • 1. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Wikipedia Infobox Type Prediction using Embeddings Russa Biswas, Rima Türker, Farshad Bakhshandegan-Moghaddam, Maria Koutraki, Harald Sack
  • 2. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Motivation ● Wikipedia Infobox Type Information contributes to the RDF type in KGs such as DBpedia Wikipedia Infobox Type DBpedia RDF Type 2
  • 3. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe ● Wikipedia 2016 version have 2000 infobox types and it follows the Zipf’s law ● 70% of the Wikipedia infobox type are missing! 3 Motivation
  • 4. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe ● Wikipedia 2016 version have 2000 infobox types and it follows the Zipf’s law ● 70% of the Wikipedia infobox type are missing! 4 Motivation Challenges: ● Difficulty in assigning infobox types ● Incomplete ● Inconsistent
  • 5. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Goal ● Given a Wikipedia article the goal is to predict infobox type information 5
  • 6. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Goal ● Given a Wikipedia article the goal is to predict infobox type information ○ Effect of minimal information such as Table of Contents and named entities in the prediction method is to be studied. 6
  • 7. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Goal ● Given a Wikipedia article the goal is to predict infobox type information ○ Effect of minimal information such as Table of Contents and named entities in the prediction method is to be studied. Approach ● Reduce the problem as a Text Classification Problem! ○ Word and Entity embeddings are used to generate the feature vectors 7
  • 8. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Features ● Abstract, Table of Contents and Named Entities from Abstract are used as features! Abstract Table of Contents Named Entities Infobox 8
  • 9. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Feature Vectors - Embeddings ● Abstract ○ Word vectors for each word from Google pre-trained Word2vec model [Mikolav, T. 2013] ● Table of Contents ○ Word vectors for each word of the content from Google pre-trained Word2vec model ● Named Entities ○ Entity vectors for each entity in the abstract from DBpedia pre-trained RDF2vec model [Ristoski, P., Paulheim, H., 2016] 9
  • 10. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Text Classification - Document Vector Generation ● Infobox Type Prediction problem can be reduced to Text Classification problem with labels as types! 10
  • 11. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Text Classification - Document Vector Generation ● Infobox Type Prediction problem can be reduced to Text Classification problem with labels as types! 11
  • 12. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Text Classification - Random Forest ● Infobox Type Prediction problem can be reduced to Text Classification problem with labels as types! L1 L2 L3 L3 …. Ln Labels Labels L1 L2 L3 L3 …. Ln Random Forest 12
  • 13. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Text Classification - Convolutional Neural Network A member of the Bush family 13 Kim Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. 2014 Aug 25. Filter Size =128, Sliding over a window of 3 words at a time
  • 14. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Experimental Setup - Dataset 14 ● Wikipedia 2016 and DBpedia 2016 version are used ● 5000 articles per infobox type for the top 30 most popular inboxes Pretrained Models ● Google pre-trained Word2vec model ○ 100 billion words from the Google news articles ○ Vector size = 300 ● RDF2Vec [Ristoski, P., Paulheim, H., 2016] ○ DBpedia ○ Vector size = 200
  • 15. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Results Feature Set With Embedding Without Embedding TF-IDF Random Forest (CV) Random Forest (Split 80% - 20%) CNN Random Forest (CV) Random Forest (Split 80% - 20%) Table of Contents 65% 65.8% 76.5% 38% 32.3% Abstract 86% 86.4% 95.1% 80% 80.4% Named Entities 45% 45.6% 62% 34% 28.9% Table of Content + Abstract 88% 88% 96.1% 83% 83.9% Named Entities + Abstract 86% 86.1% 78.4% 70% 71.2% Table1 : Performance of classifiers using micro F1 score over different feature sets 15
  • 16. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Ground Truth 16 Manually generated ground truth for 30 types Step 1 : Extract all the articles without infoboxes from Wikipedia version 2016 Step 2 : Check if these articles have an infobox type in Wikipedia version 2018 Data generated : 32000 articles in Wikipedia version 2018 is found to have new infobox types for these top 30 types Results on Unseen Data : Feature Set Random Forest TOC + Abstract 53.7% Table2 : Performance of the classifier on ground truth measured in micro F1
  • 17. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Conclusion and Future Work ● Table of Contents contains less yet important information to predict the infobox type information. ● Classification using embeddings perform better than the TF-IDF since it is able to capture the semantic similarity of words. ● Embeddings of the named entities together with words using network embedding algorithms such as LINE,PTE might improve the results of the classification process 17
  • 18. Russa Biswas et al., FIZ Karlsruhe & AIFB, KIT Karlsruhe Thank you for your attention! 18 Contacts Homepage: https://www.fiz-karlsruhe.de/de/forschung/information-service-engineering/mitarbeiter-ise/russa-biswas.html Email id : russa.biswas@fiz-karlsruhe.de : russa_biswas