SlideShare a Scribd company logo
1 of 26
Download to read offline
Using Linked Data to Mine
RDF from Wikipedia’s Tables
http://emunoz.org/wikitables
Emir Muñoz
Fujitsu (Ireland) Limited
National University of Ireland Galway
Joint work with A. Hogan and A. Mileo
WSDM 2014 @ New York City, February 24-28
Emir M. - WSDM, New York City, USA, 27th February, 2014 2
MOTIVATION
(1/10)
Emir M. - WSDM, New York City, USA, 27th February, 2014 3
MOTIVATION
The tables embedded in Wikipedia articles contain rich,
semi-structured encyclopaedic content
… BUT we cannot query all that content…
A query example:
(2/10)
Wikipedia tables or tables in the body are ignored
[Borrowed from Entity Linking tutorial]
Emir M. - WSDM, New York City, USA, 27th February, 2014 4
Results at
25-02-2014
Emir M. - WSDM, New York City, USA, 27th February, 2014 5
First result
Emir M. - WSDM, New York City, USA, 27th February, 2014 6
Second result
10
Airlines
Emir M. - WSDM, New York City, USA, 27th February, 2014 7
Third result
19
Airlines
• Same query in SPARQL over
Emir M. - WSDM, New York City, USA, 27th February, 2014 8
MOTIVATION
SELECT ?p ?o WHERE
{ <http://dbpedia.org/resource/Airbus_A380> ?p ?o . }
FAIL
(7/10)
Emir M. - WSDM, New York City, USA, 27th February, 2014 9
Emir M. - WSDM, New York City, USA, 27th February, 2014 10
No evidence of A380
• We perform automatic facts extraction (RDF)
from Wikipedia tables using KBs
MOTIVATION
Emir M. - WSDM, New York City, USA, 27th February, 2014 11
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
(10/10)
• As far as we know, DBpedia and YAGO
ignore tables in article’s body
– Mainly focused on info-boxes
• Languages such as R2RML can express
custom mappings from relational database
tables to RDF
– Each row as a subject, each column as a
predicate and each cell as an object
– Needs a mapping definition
Emir M. - WSDM, New York City, USA, 27th February, 2014 12
EXTRACTING RDF FROM TABLES (1/4)
• [Limaye et al. 2010; Mulwad et al. 2010&2013]
presented approaches using a in-house KB and
small datasets for validation
– Entity recognition/disambiguation
– Determine types for each column
– Determine relationships between columns
• We focus on Wikipedia tables, running our
algorithms over the entire corpus with
“row-centric” features for Machine
Learning models
Emir M. - WSDM, New York City, USA, 27th February, 2014 13
EXTRACTING RDF FROM TABLES (2/4)
Emir M. - WSDM, New York City, USA, 27th February, 2014 14
EXTRACTING RDF FROM TABLES
• Extraction of two types of relationships
– Between main entity and cell in the same columns,
e.g., “Manchester United F.C.” and “David de Gea”
– Between entities in different columns but same row
(3/4)
dbp:currentClub
dbp:position
Emir M. - WSDM, New York City, USA, 27th February, 2014 15
EXTRACTING RDF FROM TABLES (4/4)
• Wikipedia dump from February 13th 2013
• Table taxonomy
Emir M. - WSDM, New York City, USA, 27th February, 2014 16
WIKITABLES SURVEY (1/2)
1.14 million tables
• Table model
– Input: a source of tables (a set of tables)
• E.g., a Wikipedia article
• Each table belongs to is modeled as
an matrix
• We do normalize the tables and convert
each HTML table into a matrix
Emir M. - WSDM, New York City, USA, 27th February, 2014 17
WIKITABLES SURVEY (2/2)
• To extract RDF from Wikitables we rely on
a reference knowledge base
– Version 3.8
Emir M. - WSDM, New York City, USA, 27th February, 2014 18
MINING RDF FROM WIKITABLES
Extract links in the cells
Mapping links to DBpedia
Lookups on DBpedia to find
relationships between entities
in the same row
Candidate
relationships
Wikipedia
table
(1/6)
• We aim to discover:
– Relations between entities on the same row
– Relations between entities in the table and the
protagonist of the article
• Map the links inside the cells to RDF
resources
• Get candidate relationships from the KB
Emir M. - WSDM, New York City, USA, 27th February, 2014 19
MINING RDF FROM WIKITABLES
SELECT DISTINCT ?p1 ?p2
WHERE { {<e1>} ?p1 <e2> } UNION { <e2> ?p2 <e1>} }
(2/6)
• We detected some weak relationships
• … We need more filtering for relationships
Emir M. - WSDM, New York City, USA, 27th February, 2014 20
MINING RDF FROM WIKITABLES
dbp:currentClub
dbp:youthClubs
(3/6)
• Features at different levels used to train
Machine Learning models
• Article features (e.g., # of tables)
• Table features (e.g., #rows, #columns, ratios)
• Cell features (e.g., # of entities, string length, has
format)
• Column features (e.g., # of entities, # of unique
entities)
• Predicate/Column features (e.g., string similarity, # of
rows where relation holds)
• Predicate features (e.g., triple count, count unique)
• Triple features (e.g., is the table from article or body)
Emir M. - WSDM, New York City, USA, 27th February, 2014 21
MINING RDF FROM WIKITABLES (4/6)
• The experimentation set-up
– Wikipedia dump from February 2013
– DBpedia dump version 3.8
– 8 machines (ca. 2005) with 4GB of RAM,
2.2GHz single-core processors
• After 12 days we got 34.9 million unique
triples not in DBpedia
• We manually annotated a sample of 750
triples to train the ML models
Emir M. - WSDM, New York City, USA, 27th February, 2014 22
MINING RDF FROM WIKITABLES (5/6)
Emir M. - WSDM, New York City, USA, 27th February, 2014 23
MINING RDF FROM WIKITABLES (6/6)
Bagging DT Simple Logistic SVM
accuracy 78.1% 78.53% 72.6%
precision 81.5% 79.62% 72.4%
recall 77.4% 79.01% 75.8%
• In this work we aimed to
– Interpret the semantic of tables using KB’s
– Enrich KB’s with new facts mined from tables
• With the best model we got 7.9 million
unique novel triples
• We still don’t
– consider literals/string values in the cells
– Explode domain/range of predicates
– Test other KBs like Freebase and YAGO
Emir M. - WSDM, New York City, USA, 27th February, 2014 24
CONCLUSION
• Most of the related papers use some
knowledge base, such as DBpedia
– They can be benefited by new RDF triples
extracted from Wikipedia tables
• We can use the similarity proposed in
Knowledge-based graph document modeling, by
Schuhmacher and Ponzetto, to improve the
relation extraction
• And use the paper Trust, but Verify: Predicting
Contribution Quality for Knowledge Base Construction
and Curation, Chun How et al, to determine the
correctness of the quality of the output triples
Emir M. - WSDM, New York City, USA, 27th February, 2014
CONTRAST WITH OTHER PAPERS
25
Thank you!
Emir Muñoz
SVM our third best model 
http://emunoz.org/wikitables

More Related Content

What's hot

Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databasesGraph-TA
 
Database design
Database designDatabase design
Database designRiteshkiit
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataGraph-TA
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables祺傑 林
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinthsDaniel Camarda
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything projectEnrico Daga
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEIEnrico Daga
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge queryStanley Wang
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsGraph-TA
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesAlexandra Roatiș
 
PhillyR 18-19 Kickoff - Data Structure Intro
PhillyR 18-19 Kickoff - Data Structure IntroPhillyR 18-19 Kickoff - Data Structure Intro
PhillyR 18-19 Kickoff - Data Structure IntroLeon Kim
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql ProjectMustafa Jarrar
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe
 

What's hot (20)

Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
Mods0210
Mods0210Mods0210
Mods0210
 
Database design
Database designDatabase design
Database design
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Linked list
Linked listLinked list
Linked list
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
Analyzing poetry databases to develop a metadata application profile. Why eac...
Analyzing poetry databases to develop a metadata application profile. Why eac...Analyzing poetry databases to develop a metadata application profile. Why eac...
Analyzing poetry databases to develop a metadata application profile. Why eac...
 
Reference Hackers
Reference HackersReference Hackers
Reference Hackers
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
 
PhillyR 18-19 Kickoff - Data Structure Intro
PhillyR 18-19 Kickoff - Data Structure IntroPhillyR 18-19 Kickoff - Data Structure Intro
PhillyR 18-19 Kickoff - Data Structure Intro
 
Jesús Barrasa
Jesús BarrasaJesús Barrasa
Jesús Barrasa
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia Mappings
 
ISDD Database Structure N5
ISDD Database Structure N5ISDD Database Structure N5
ISDD Database Structure N5
 

Viewers also liked

Competition, bargaining power and pricing in two sided markets
Competition, bargaining power and pricing in two sided marketsCompetition, bargaining power and pricing in two sided markets
Competition, bargaining power and pricing in two sided marketsKimmo Soramaki
 
An Optimization Framework for Query Recommendation
An Optimization Framework for Query RecommendationAn Optimization Framework for Query Recommendation
An Optimization Framework for Query Recommendationguestbc5c99
 
Wsdm west wesley-smith
Wsdm west wesley-smithWsdm west wesley-smith
Wsdm west wesley-smithJevin West
 
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010Yury Lifshits
 
Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014Olivier Chapelle
 
Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...IntoTheMinds
 

Viewers also liked (6)

Competition, bargaining power and pricing in two sided markets
Competition, bargaining power and pricing in two sided marketsCompetition, bargaining power and pricing in two sided markets
Competition, bargaining power and pricing in two sided markets
 
An Optimization Framework for Query Recommendation
An Optimization Framework for Query RecommendationAn Optimization Framework for Query Recommendation
An Optimization Framework for Query Recommendation
 
Wsdm west wesley-smith
Wsdm west wesley-smithWsdm west wesley-smith
Wsdm west wesley-smith
 
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
 
Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014
 
Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...
 

Similar to Using Linked Data to Mine RDF from Wikipedia's Tables

Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutMediaMixerCommunity
 
semantic web resource description framework
semantic web resource description frameworksemantic web resource description framework
semantic web resource description frameworkKomalFatima37
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generationkrisztianbalog
 
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)Marcia Zeng
 
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...Fabrizio Orlandi
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...Digitised Manuscripts to Europeana
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersNew York University
 
Chapter3_a_updated.ppt
Chapter3_a_updated.pptChapter3_a_updated.ppt
Chapter3_a_updated.pptAwais Qarni
 
Tue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataTue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataeswcsummerschool
 
Dublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract ModelDublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract ModelJenn Riley
 
Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2Richard Urban
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic WebRob Paok
 
Scalable Web Data Management using RDF
Scalable Web Data Management using RDF  Scalable Web Data Management using RDF
Scalable Web Data Management using RDF Navid Sedighpour
 

Similar to Using Linked Data to Mine RDF from Wikipedia's Tables (20)

Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playout
 
semantic web resource description framework
semantic web resource description frameworksemantic web resource description framework
semantic web resource description framework
 
Rdf
RdfRdf
Rdf
 
Technical Background
Technical BackgroundTechnical Background
Technical Background
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
KCS-501-3.pdf
KCS-501-3.pdfKCS-501-3.pdf
KCS-501-3.pdf
 
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
 
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Dbms and sqlpptx
Dbms and sqlpptxDbms and sqlpptx
Dbms and sqlpptx
 
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
 
Chapter3_a_updated.ppt
Chapter3_a_updated.pptChapter3_a_updated.ppt
Chapter3_a_updated.ppt
 
Tue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataTue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddata
 
Dublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract ModelDublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract Model
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic Web
 
Scalable Web Data Management using RDF
Scalable Web Data Management using RDF  Scalable Web Data Management using RDF
Scalable Web Data Management using RDF
 

More from Emir Muñoz

A Linked Data-Based Decision Tree Classifier to Review Movies
A Linked Data-Based Decision Tree Classifier to Review MoviesA Linked Data-Based Decision Tree Classifier to Review Movies
A Linked Data-Based Decision Tree Classifier to Review MoviesEmir Muñoz
 
The Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingThe Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingEmir Muñoz
 
Web Intelligence - 2010
Web Intelligence - 2010Web Intelligence - 2010
Web Intelligence - 2010Emir Muñoz
 
μRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elementsμRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elementsEmir Muñoz
 
Learning Content Patterns from Linked Data
Learning Content Patterns from Linked DataLearning Content Patterns from Linked Data
Learning Content Patterns from Linked DataEmir Muñoz
 
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y ValidaciónClaves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y ValidaciónEmir Muñoz
 
Reading Group 2014
Reading Group 2014Reading Group 2014
Reading Group 2014Emir Muñoz
 
Soft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML DataSoft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML DataEmir Muñoz
 
DRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From WikitablesDRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From WikitablesEmir Muñoz
 
WikiTables DERI Talk
WikiTables DERI TalkWikiTables DERI Talk
WikiTables DERI TalkEmir Muñoz
 

More from Emir Muñoz (11)

A Linked Data-Based Decision Tree Classifier to Review Movies
A Linked Data-Based Decision Tree Classifier to Review MoviesA Linked Data-Based Decision Tree Classifier to Review Movies
A Linked Data-Based Decision Tree Classifier to Review Movies
 
The Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingThe Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data Modelling
 
Web Intelligence - 2010
Web Intelligence - 2010Web Intelligence - 2010
Web Intelligence - 2010
 
μRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elementsμRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elements
 
Learning Content Patterns from Linked Data
Learning Content Patterns from Linked DataLearning Content Patterns from Linked Data
Learning Content Patterns from Linked Data
 
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y ValidaciónClaves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
 
Reading Group 2014
Reading Group 2014Reading Group 2014
Reading Group 2014
 
Soft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML DataSoft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML Data
 
DRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From WikitablesDRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From Wikitables
 
DEXA 2012 Talk
DEXA 2012 TalkDEXA 2012 Talk
DEXA 2012 Talk
 
WikiTables DERI Talk
WikiTables DERI TalkWikiTables DERI Talk
WikiTables DERI Talk
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Using Linked Data to Mine RDF from Wikipedia's Tables

  • 1. Using Linked Data to Mine RDF from Wikipedia’s Tables http://emunoz.org/wikitables Emir Muñoz Fujitsu (Ireland) Limited National University of Ireland Galway Joint work with A. Hogan and A. Mileo WSDM 2014 @ New York City, February 24-28
  • 2. Emir M. - WSDM, New York City, USA, 27th February, 2014 2 MOTIVATION (1/10)
  • 3. Emir M. - WSDM, New York City, USA, 27th February, 2014 3 MOTIVATION The tables embedded in Wikipedia articles contain rich, semi-structured encyclopaedic content … BUT we cannot query all that content… A query example: (2/10) Wikipedia tables or tables in the body are ignored [Borrowed from Entity Linking tutorial]
  • 4. Emir M. - WSDM, New York City, USA, 27th February, 2014 4 Results at 25-02-2014
  • 5. Emir M. - WSDM, New York City, USA, 27th February, 2014 5 First result
  • 6. Emir M. - WSDM, New York City, USA, 27th February, 2014 6 Second result 10 Airlines
  • 7. Emir M. - WSDM, New York City, USA, 27th February, 2014 7 Third result 19 Airlines
  • 8. • Same query in SPARQL over Emir M. - WSDM, New York City, USA, 27th February, 2014 8 MOTIVATION SELECT ?p ?o WHERE { <http://dbpedia.org/resource/Airbus_A380> ?p ?o . } FAIL (7/10)
  • 9. Emir M. - WSDM, New York City, USA, 27th February, 2014 9
  • 10. Emir M. - WSDM, New York City, USA, 27th February, 2014 10 No evidence of A380
  • 11. • We perform automatic facts extraction (RDF) from Wikipedia tables using KBs MOTIVATION Emir M. - WSDM, New York City, USA, 27th February, 2014 11 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ (10/10)
  • 12. • As far as we know, DBpedia and YAGO ignore tables in article’s body – Mainly focused on info-boxes • Languages such as R2RML can express custom mappings from relational database tables to RDF – Each row as a subject, each column as a predicate and each cell as an object – Needs a mapping definition Emir M. - WSDM, New York City, USA, 27th February, 2014 12 EXTRACTING RDF FROM TABLES (1/4)
  • 13. • [Limaye et al. 2010; Mulwad et al. 2010&2013] presented approaches using a in-house KB and small datasets for validation – Entity recognition/disambiguation – Determine types for each column – Determine relationships between columns • We focus on Wikipedia tables, running our algorithms over the entire corpus with “row-centric” features for Machine Learning models Emir M. - WSDM, New York City, USA, 27th February, 2014 13 EXTRACTING RDF FROM TABLES (2/4)
  • 14. Emir M. - WSDM, New York City, USA, 27th February, 2014 14 EXTRACTING RDF FROM TABLES • Extraction of two types of relationships – Between main entity and cell in the same columns, e.g., “Manchester United F.C.” and “David de Gea” – Between entities in different columns but same row (3/4) dbp:currentClub dbp:position
  • 15. Emir M. - WSDM, New York City, USA, 27th February, 2014 15 EXTRACTING RDF FROM TABLES (4/4)
  • 16. • Wikipedia dump from February 13th 2013 • Table taxonomy Emir M. - WSDM, New York City, USA, 27th February, 2014 16 WIKITABLES SURVEY (1/2) 1.14 million tables
  • 17. • Table model – Input: a source of tables (a set of tables) • E.g., a Wikipedia article • Each table belongs to is modeled as an matrix • We do normalize the tables and convert each HTML table into a matrix Emir M. - WSDM, New York City, USA, 27th February, 2014 17 WIKITABLES SURVEY (2/2)
  • 18. • To extract RDF from Wikitables we rely on a reference knowledge base – Version 3.8 Emir M. - WSDM, New York City, USA, 27th February, 2014 18 MINING RDF FROM WIKITABLES Extract links in the cells Mapping links to DBpedia Lookups on DBpedia to find relationships between entities in the same row Candidate relationships Wikipedia table (1/6)
  • 19. • We aim to discover: – Relations between entities on the same row – Relations between entities in the table and the protagonist of the article • Map the links inside the cells to RDF resources • Get candidate relationships from the KB Emir M. - WSDM, New York City, USA, 27th February, 2014 19 MINING RDF FROM WIKITABLES SELECT DISTINCT ?p1 ?p2 WHERE { {<e1>} ?p1 <e2> } UNION { <e2> ?p2 <e1>} } (2/6)
  • 20. • We detected some weak relationships • … We need more filtering for relationships Emir M. - WSDM, New York City, USA, 27th February, 2014 20 MINING RDF FROM WIKITABLES dbp:currentClub dbp:youthClubs (3/6)
  • 21. • Features at different levels used to train Machine Learning models • Article features (e.g., # of tables) • Table features (e.g., #rows, #columns, ratios) • Cell features (e.g., # of entities, string length, has format) • Column features (e.g., # of entities, # of unique entities) • Predicate/Column features (e.g., string similarity, # of rows where relation holds) • Predicate features (e.g., triple count, count unique) • Triple features (e.g., is the table from article or body) Emir M. - WSDM, New York City, USA, 27th February, 2014 21 MINING RDF FROM WIKITABLES (4/6)
  • 22. • The experimentation set-up – Wikipedia dump from February 2013 – DBpedia dump version 3.8 – 8 machines (ca. 2005) with 4GB of RAM, 2.2GHz single-core processors • After 12 days we got 34.9 million unique triples not in DBpedia • We manually annotated a sample of 750 triples to train the ML models Emir M. - WSDM, New York City, USA, 27th February, 2014 22 MINING RDF FROM WIKITABLES (5/6)
  • 23. Emir M. - WSDM, New York City, USA, 27th February, 2014 23 MINING RDF FROM WIKITABLES (6/6) Bagging DT Simple Logistic SVM accuracy 78.1% 78.53% 72.6% precision 81.5% 79.62% 72.4% recall 77.4% 79.01% 75.8%
  • 24. • In this work we aimed to – Interpret the semantic of tables using KB’s – Enrich KB’s with new facts mined from tables • With the best model we got 7.9 million unique novel triples • We still don’t – consider literals/string values in the cells – Explode domain/range of predicates – Test other KBs like Freebase and YAGO Emir M. - WSDM, New York City, USA, 27th February, 2014 24 CONCLUSION
  • 25. • Most of the related papers use some knowledge base, such as DBpedia – They can be benefited by new RDF triples extracted from Wikipedia tables • We can use the similarity proposed in Knowledge-based graph document modeling, by Schuhmacher and Ponzetto, to improve the relation extraction • And use the paper Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation, Chun How et al, to determine the correctness of the quality of the output triples Emir M. - WSDM, New York City, USA, 27th February, 2014 CONTRAST WITH OTHER PAPERS 25
  • 26. Thank you! Emir Muñoz SVM our third best model  http://emunoz.org/wikitables