DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

•Als ODP, PDF herunterladen•

0 gefällt mir•459 views

Determining the semantic relatedness (i.e., the strength of a relation) of two resources in DBpedia (or other Linked Data sources) is a problem addressed by quite a few approaches in the recent past. However, there are no large-scale benchmark datasets for comparing such approaches, and it is an open problem to determine which of the approaches work better than others. Furthermore, larget-scale datasets for training machine learning based approaches are not available. DBpediaNYD is a large-scale synthetic silver standard benchmark dataset which contains symmetric and asymmetric similarity values, obtained using a web search engine.

Technologie Bildung

DBpediaNYD –
A Silver Standard Benchmark Dataset
for Semantic Relatedness in DBpedia

10/22/13 Paulheim Heiko Paulheim
Heiko

1

Motivation
•

There are quite a few approaches to entity ranking/
statement weighting on Linked Data
– and DBpedia in particular

•

Examples:
– Franz et al. (2009) – Tensor Decomposition
– Meij et al. (2009) – Machine Learning
– Mirizzi et al. (2010) – Web Search Engines
– Mulay and Kumar (2011) – Machine Learning
– Hees et al. (2012) – Crowd Sourcing
– Nunes et al. (2012) – Social Network Analysis

10/22/13

Heiko Paulheim

2

Motivation
•

However,
– none of those have been competitively evaluated
– none of those have been evaluated at large scale

•

Evaluation with
– small private data sets
– user studies

•

Approaches using Machine Learning
– requires training data
– expensive to obtain

10/22/13

Heiko Paulheim

3

The Dataset
•

Large-scale dataset (several thousand instances)
– statements with strengths

•

Strength value: Normalized Google Distance

•

f(x): number of search results containing x

•

f(x,y): number of search results containing both x and y

•

M: number of pages in search engine index

•

NGD has been shown to correlate with human strength associations

10/22/13

Heiko Paulheim

4

The Dataset
•

NGD is a symmetric value
– NYD dataset also contains asymmetric values

•

Asymmetric Normalized Google Distance

•

f(x): number of search results containing x

•

f(x,y): number of search results containing both x and y

•

M: number of pages in search engine index

10/22/13

Heiko Paulheim

5

Constructing the Dataset
•

We sampled 10,000 statements
– with DBpedia resources as subject and object
(e.g., no type statements, no literals)
– with dbpedia or dbpprop predicate

•

...and computed symmetric/asymmetric NGD
– using the labels as search strings
– using Yahoo BOSS

10/22/13

Heiko Paulheim

6

The Dataset
•

Random sample of 10,000 statements
– i.e., 30,000 search engine calls (80c/1,000 → 24 USD)

•

3,058 pairs of resources had to be discarded
– f(x)<f(x,y) or f(y)<f(x,y)
– search engines sometimes don't count properly :-(

•

Result:
– 6,942 weighted statements (symmetric)
– 13,884 weighted statements (asymmetric)

10/22/13

Heiko Paulheim

7

The Dataset
•

Example:
– dbpedia:John_Lennon and dbpedia:Yoko_Ono

•

Distances:
– symmetric: 0.18
– John Lennon → Yoko Ono 0.18
– Yoko Ono → John Lennon 0.03

•

Explanation:
– Yoko Ono is famous for being John Lennon's wife
• and most often mentioned in that context
– John Lennon is more famous for being a member of the Beatles

10/22/13

Heiko Paulheim

8

Example: the DBpedia FindRelated Service
•

We trained two regression SVMs (LibSVM) based on DBpediaNYD
– one for symmetric, one for asymmetric
– service allows for finding the most related among the linked resources

•

Example results:

•

http://wiki.dbpedia.org/FindRelated

10/22/13

Heiko Paulheim

9

Conclusion and Outlook
•

DBpediaNYD allows for large scale evaluation
– rather a silver standard
– does not replace manually created gold standards

•

Future work
– validate DBpediaNYD with users
– compare search engines

10/22/13

Heiko Paulheim

10

Something Completely Different
•

Challenges enumerated in the workshop intro this morning
– “Logical inference on noisy data”

•

Talk on “Type Inference on Noisy RDF Data”
– Was actually applied for DBpedia 3.9
– Friday, 3:15, Bayside 204A

10/22/13

Heiko Paulheim

11

DBpediaNYD –
A Silver Standard Benchmark Dataset
for Semantic Relatedness in DBpedia

10/22/13 Paulheim Heiko Paulheim
Heiko

12

Weitere ähnliche Inhalte

Was ist angesagt?

Similarity: Retrieving DocumentsLearnbay Datascience

2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge

Connections that work: Linked Open Data demystifiedJakob .

Freedom for bibliographic references: OpenCitations ariseUniversity of Bologna

PhyloTastic: names-based phyloinformatic data integrationRutger Vos

Dbd arrrrcamp-2013Peter Vandenabeele

Was ist angesagt? (6)

Similarity: Retrieving Documents

2019 03 05_biological_databases_part4_v_upload

Connections that work: Linked Open Data demystified

Freedom for bibliographic references: OpenCitations arise

PhyloTastic: names-based phyloinformatic data integration

Dbd arrrrcamp-2013

Andere mochten auch

Using DBpedia for Thesaurus Management and Linked Open Data IntegrationMartin Kaltenböck

Portails documentaires et référentiels du Web sémantique : exemples et enjeu...Alexandre Monnin

Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...ADBS

JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...GUANGYUAN PIAO

Requêtes sparqlFipBast

Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...ADBS

Lancement de Semanticpédia et DBpédia.frFabien Gandon

Thérèse Libourel, atelier Ontologies avec ProtégéUMR 7324 CITERES - Laboratoire Archéologie et Territoires, Tours

Thérèse Libourel, Ontologies en SHS, 2015-11-09, ToursUMR 7324 CITERES - Laboratoire Archéologie et Territoires, Tours

Andere mochten auch (9)

Using DBpedia for Thesaurus Management and Linked Open Data Integration

Portails documentaires et référentiels du Web sémantique : exemples et enjeu...

Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...

JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...

Requêtes sparql

Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...

Lancement de Semanticpédia et DBpédia.fr

Thérèse Libourel, atelier Ontologies avec Protégé

Thérèse Libourel, Ontologies en SHS, 2015-11-09, Tours

Ähnlich wie DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Data_Science.pptANGADPRAJAPATI3

Where is my data (in the cloud) tamir dresherTamir Dresher

Elag workshop sessie 1 en 2 v10Jeroen Rombouts

Research Data ManagementSarah Jones

Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim

Big Data Real Time Training in ChennaiVijay Susheedran C G

Big Data 101 - An introductionNeeraj Tewari

Ordering the chaos: Creating websites with imperfect dataAndy Stretton

Research Lifecycles and RDMMarieke Guy

Quettra Design Problem Solution - Deepti Chafekarquettra

DS2014: Feature selection in hierarchical feature spacesPetar Ristoski

DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkGezim Sejdiu

week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.pptRidoVercascade

Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger

Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Lucidworks

DatamininglectureManish Rana

Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique

data miningnehaanand123

Ähnlich wie DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia (20)

Data_Science.ppt

Where is my data (in the cloud) tamir dresher

Elag workshop sessie 1 en 2 v10

Research Data Management

Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

Big Data Real Time Training in Chennai

Big Data 101 - An introduction

Ordering the chaos: Creating websites with imperfect data

Research Lifecycles and RDM

Quettra Design Problem Solution - Deepti Chafekar

DS2014: Feature selection in hierarchical feature spaces

DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk

week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt

Scaling Recommendations, Semantic Search, & Data Analytics with solr

Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...

Datamininglecture

Detection of Related Semantic Datasets Based on Frequent Subgraph Mining

data mining

Mehr von Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim

What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim

New Adventures in RDF2vecHeiko Paulheim

Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim

From Wikis to Knowledge GraphsHeiko Paulheim

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim

Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim

Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphHeiko Paulheim

Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim

Make Embeddings Semantic Again!Heiko Paulheim

How much is a Triple?Heiko Paulheim

Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim

Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim

Towards Knowledge Graph ProfilingHeiko Paulheim

Knowledge Graphs on the WebHeiko Paulheim

Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim

Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim

Mehr von Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...

What_do_Knowledge_Graph_Embeddings_Learn.pdf

New Adventures in RDF2vec

Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems

From Wikis to Knowledge Graphs

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...

Machine Learning & Embeddings for Large Knowledge Graphs

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...

Make Embeddings Semantic Again!

How much is a Triple?

Machine Learning with and for Semantic Web Knowledge Graphs

Weakly Supervised Learning for Fake News Detection on Twitter

Towards Knowledge Graph Profiling

Knowledge Graphs on the Web

Data-driven Joint Debugging of the DBpedia Mappings and Ontology

Fast Approximate A-box Consistency Checking using Machine Learning

Kürzlich hochgeladen

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

A Call to Action for Generative AI in 2024Results

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

How to convert PDF to text with Nanonetsnaman860154

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Kürzlich hochgeladen (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

Presentation on how to chat with PDF using ChatGPT code interpreter

Finology Group – Insurtech Innovation Award 2024

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

A Call to Action for Generative AI in 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

[2024]Digital Global Overview Report 2024 Meltwater.pdf

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

A Domino Admins Adventures (Engage 2024)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Handwritten Text Recognition for manuscripts and early printed texts

Unblocking The Main Thread Solving ANRs and Frozen Frames

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

How to Troubleshoot Apps for the Modern Connected Worker

How to convert PDF to text with Nanonets

🐬 The future of MySQL is Postgres 🐘

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

1. DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia 10/22/13 Paulheim Heiko Paulheim Heiko 1

2. Motivation • There are quite a few approaches to entity ranking/ statement weighting on Linked Data – and DBpedia in particular • Examples: – Franz et al. (2009) – Tensor Decomposition – Meij et al. (2009) – Machine Learning – Mirizzi et al. (2010) – Web Search Engines – Mulay and Kumar (2011) – Machine Learning – Hees et al. (2012) – Crowd Sourcing – Nunes et al. (2012) – Social Network Analysis 10/22/13 Heiko Paulheim 2

3. Motivation • However, – none of those have been competitively evaluated – none of those have been evaluated at large scale • Evaluation with – small private data sets – user studies • Approaches using Machine Learning – requires training data – expensive to obtain 10/22/13 Heiko Paulheim 3

4. The Dataset • Large-scale dataset (several thousand instances) – statements with strengths • Strength value: Normalized Google Distance • f(x): number of search results containing x • f(x,y): number of search results containing both x and y • M: number of pages in search engine index • NGD has been shown to correlate with human strength associations 10/22/13 Heiko Paulheim 4

5. The Dataset • NGD is a symmetric value – NYD dataset also contains asymmetric values • Asymmetric Normalized Google Distance • f(x): number of search results containing x • f(x,y): number of search results containing both x and y • M: number of pages in search engine index 10/22/13 Heiko Paulheim 5

6. Constructing the Dataset • We sampled 10,000 statements – with DBpedia resources as subject and object (e.g., no type statements, no literals) – with dbpedia or dbpprop predicate • ...and computed symmetric/asymmetric NGD – using the labels as search strings – using Yahoo BOSS 10/22/13 Heiko Paulheim 6

7. The Dataset • Random sample of 10,000 statements – i.e., 30,000 search engine calls (80c/1,000 → 24 USD) • 3,058 pairs of resources had to be discarded – f(x)<f(x,y) or f(y)<f(x,y) – search engines sometimes don't count properly :-( • Result: – 6,942 weighted statements (symmetric) – 13,884 weighted statements (asymmetric) 10/22/13 Heiko Paulheim 7

8. The Dataset • Example: – dbpedia:John_Lennon and dbpedia:Yoko_Ono • Distances: – symmetric: 0.18 – John Lennon → Yoko Ono 0.18 – Yoko Ono → John Lennon 0.03 • Explanation: – Yoko Ono is famous for being John Lennon's wife • and most often mentioned in that context – John Lennon is more famous for being a member of the Beatles 10/22/13 Heiko Paulheim 8

9. Example: the DBpedia FindRelated Service • We trained two regression SVMs (LibSVM) based on DBpediaNYD – one for symmetric, one for asymmetric – service allows for finding the most related among the linked resources • Example results: • http://wiki.dbpedia.org/FindRelated 10/22/13 Heiko Paulheim 9

10. Conclusion and Outlook • DBpediaNYD allows for large scale evaluation – rather a silver standard – does not replace manually created gold standards • Future work – validate DBpediaNYD with users – compare search engines 10/22/13 Heiko Paulheim 10

11. Something Completely Different • Challenges enumerated in the workshop intro this morning – “Logical inference on noisy data” • Talk on “Type Inference on Noisy RDF Data” – Was actually applied for DBpedia 3.9 – Friday, 3:15, Bayside 204A 10/22/13 Heiko Paulheim 11

12. DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia 10/22/13 Paulheim Heiko Paulheim Heiko 12

DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Ähnlich wie DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia (20)

Mehr von Heiko Paulheim

Mehr von Heiko Paulheim (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia