SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Crowdsourcing Linked Data Quality Assessment
Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer
and Jens Lehmann
@ISWC2013

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

www.kit.edu
Motivation
Varying quality of Linked Data sources
Some quality issues require certain interpretation
that can be easily performed by humans
dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.

Solution: Include human verification in the
process of LD quality assessment
Direct application: Detecting pattern in errors
may allow to identify (and correct) the extraction
mechanisms
3

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Research questions
RQ1: Is it possible to detect quality issues in LD data sets
via crowdsourcing mechanisms?

RQ2: What type of crowd is most suitable for each type of
quality issue?

RQ3: Which types of errors are made by lay users and
experts when assessing RDF triples?
4

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Related work
DBpedia
Assessing LD
mappings

ZenCrowd
Entity resolution

(Automatic)

Crowdsourcing
& Linked Data
CrowdMAP
Ontology allignment

Web of data
quality
assessment

Quality
characteristics of
LD data sources
(Semi-automatic)

WIQA, Sieve,
(Manual)

GWAP for LD
Our work
5

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
OUR APPROACH

6

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Methodology

2
1
Correct
{s p o .}

Dataset

{s p o .}
3

Incorrect +
Quality issue

Steps to implement the methodology
1

2

Selecting the appropriate crowdsourcing approaches

3
7

Selecting LD quality issues to crowdsource

Designing and generating the interfaces to present the data
to the crowd

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
1

Selecting LD quality issues
to crowdsource

Three categories of quality problems occur
in DBpedia [Zaveri2013] and can be crowdsourced:
Incorrect object
 Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.

Incorrect data type or language tags
 Example: dbpedia:Torishima_Izu_Islands foaf:name “

”@en.

Incorrect link to “external Web pages”
 Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink
<http://cedarlakedvd.com/>

8

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
2

Selecting appropriate
crowdsourcing approaches (1)

Find

Verify

Contest

Microtasks

LD Experts
Difficult task
Final prize

Workers
Easy task
Micropayments

TripleCheckMate
[Kontoskostas2013]

MTurk
http://mturk.com

Adapted from [Bernstein2010]
9

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
3

Presenting the data to the crowd

Microtask interfaces: MTurk tasks
Incorrect object

• Selection of foaf:name or
rdfs:label to extract humanreadable descriptions
• Values extracted automatically
from Wikipedia infoboxes
• Link to the Wikipedia article via
foaf:isPrimaryTopicOf

Incorrect data type or language tag

Incorrect outlink

• Preview of external pages by
implementing HTML iframe

10

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
EXPERIMENTAL STUDY

11

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Experimental design
• Crowdsourcing approaches:
• Find stage: Contest with LD experts
• Verify stage: Microtasks (5 assignments)

• Creation of a gold standard:
• Two of the authors of this paper (MA, AZ) generated the gold
standard for all the triples obtained from the contest
• Each author independently evaluated the triples
• Conflicts were resolved via mutual agreement

• Metric: precision

12

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Overall results
LD Experts
Number of distinct
participants
Total time

Total triples evaluated
Total cost

13

28.10.2013

Microtask workers

50

80

3 weeks (predefined)

4 days

1,512

1,073

~ US$ 400 (predefined)

~ US$ 43

Maribel Acosta - Identifying DBpedia Quality Issues via Crowdsourcing

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Precision results: Incorrect object task
• MTurk workers can be used to reduce the error rates of LD experts for
the Find stage
Triples compared

LD Experts

MTurk
(majority voting: n=5)

509

0.7151

0.8977

• 117 DBpedia triples had predicates related to dates with
incorrect/incomplete values:
”2005 Six Nations Championship” Date 12 .
• 52 DBpedia triples had erroneous values from the source:
”English (programming language)” Influenced by ? .
•

•

14

Experts classified all these triples as incorrect

Workers compared values against Wikipedia and successfully classified this
triples as “correct”

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Precision results: Incorrect data type task
Triples compared

LD Experts

MTurk
(majority voting: n=5)

341

0.8270

0.4752

Number of triples

140

Experts TP

120

Experts FP
100

Crowd TP

80

Crowd FP

60
40
20
0
Date

English Millimetre

Nanometre
Number

Number
with
decimals

Data types
15

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Second

Volt

Year

Not
specified /
URI

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Precision results: Incorrect link task
Triples compared

Baseline

LD Experts

MTurk
(n=5 majority voting)

223

0.2598

0.1525

0.9412

• We analyzed the 189 misclassifications by the experts:
11%

39%

Freebase links
50%

Wikipedia images
External links

• The 6% misclassifications by the workers correspond to
pages with a language different from English.
16

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Final discussion
RQ1: Is it possible to detect quality issues in LD data sets via
crowdsourcing mechanisms?

Both forms of crowdsourcing can be applied to detect certain
LD quality issues
RQ2: What type of crowd is most suitable for each type of quality issue?

The effort of LD experts must be applied on those tasks
demanding specific-domain skills. MTurk crowd was
exceptionally good at performing data comparisons
RQ3: Which types of errors are made by lay users and experts?

Lay users do not have the skills to solve domain-specific
tasks, while experts performance is very low on tasks that
demand an extra effort (e.g., checking an external page)
17

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
CONCLUSIONS & FUTURE WORK

18

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Conclusions & Future Work
A crowdsourcing methodology for LD quality assessment:
Find stage: LD experts
Verify stage: MTurk workers

Crowdsourcing approaches are feasible in detecting the
studied quality issues
Application: Detecting pattern in errors to fix the extraction
mechanisms

Future Work
Conducting new experiments (other quality issues and domains)
Integration of the crowd into curation processes and tools
19

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
References & Acknowledgements
[Bernstein2010]

M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R.
Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd
inside. In Proceedings of the 23nd annual ACM symposium on User interface
software and technology, UIST ’10, pages 313–322, New
York, NY, USA, 2010. ACM.

[Kontoskostas2013]

D Kontokostas, A Zaveri, S Auer, J Lehmann. TripleCheckMate: A Tool for
Crowdsourcing the Quality Assessment of Linked Data . Knowledge
Engineering and the Semantic Web, 2013

[Zaveri2013]

A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer.
Quality as- sessment methodologies for linked open data. Under
review, http://www.semantic-web-journal.net/content/quality-assessmentmethodologies-linked-open-data.

20

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)
Approach

MTurk tasks
Incorrect object
Verify

Find

Contest

Microtasks

LD Experts
Difficult task
Final prize

Workers
Easy task
Micropayments

TripleCheckMate

Incorrect data type

MTurk

Incorrect outlink

Results: Precision
Object
values

Data types

Interlinks

Linked Data
experts

0.7151

0.8270

0.1525

MTurk

0.8977

0.4752

0.9412

(majority voting)

21

28.10.2013

Acosta et al. – Crowdsourcing Linked Data Quality Assessment

QUESTIONS?
Institut für Angewandte Informatik und Formale
Beschreibungsverfahren (AIFB)

Weitere ähnliche Inhalte

Was ist angesagt?

The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsCraig Knoblock
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and VisualizationDr. Neil Brittliff
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government DataRichard Cyganiak
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...Thomas Lancaster
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesTraian Rebedea
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...Tao Xie
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchTao Xie
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 

Was ist angesagt? (20)

The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government Data
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
The Implications of Plagiarism and Contract Cheating for the Assessment of Da...
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 

Andere mochten auch

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingMaribel Acosta Deibe
 
Conference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataConference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataAnna Lisa Gentile
 
CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013Lora Aroyo
 
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialSemantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialMaribel Acosta Deibe
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyAmrapali Zaveri, PhD
 

Andere mochten auch (7)

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
 
Conference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataConference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic Data
 
CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013
 
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialSemantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 

Ähnlich wie Crowdsourcing Linked Data Quality Assessment

Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzujerdeb
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?andrea huang
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesConnected Data World
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
 
Assigning semantic labels to data sources
Assigning semantic labels to data sourcesAssigning semantic labels to data sources
Assigning semantic labels to data sourcesCraig Knoblock
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia StudyMaribel Acosta Deibe
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsIRJET Journal
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentationPaolo Missier
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 

Ähnlich wie Crowdsourcing Linked Data Quality Assessment (20)

Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the pieces
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
Assigning semantic labels to data sources
Assigning semantic labels to data sourcesAssigning semantic labels to data sources
Assigning semantic labels to data sources
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 

Kürzlich hochgeladen

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Kürzlich hochgeladen (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Crowdsourcing Linked Data Quality Assessment

  • 1. Crowdsourcing Linked Data Quality Assessment Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer and Jens Lehmann @ISWC2013 KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
  • 2. Motivation Varying quality of Linked Data sources Some quality issues require certain interpretation that can be easily performed by humans dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”. Solution: Include human verification in the process of LD quality assessment Direct application: Detecting pattern in errors may allow to identify (and correct) the extraction mechanisms 3 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 3. Research questions RQ1: Is it possible to detect quality issues in LD data sets via crowdsourcing mechanisms? RQ2: What type of crowd is most suitable for each type of quality issue? RQ3: Which types of errors are made by lay users and experts when assessing RDF triples? 4 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 4. Related work DBpedia Assessing LD mappings ZenCrowd Entity resolution (Automatic) Crowdsourcing & Linked Data CrowdMAP Ontology allignment Web of data quality assessment Quality characteristics of LD data sources (Semi-automatic) WIQA, Sieve, (Manual) GWAP for LD Our work 5 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 5. OUR APPROACH 6 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 6. Methodology 2 1 Correct {s p o .} Dataset {s p o .} 3 Incorrect + Quality issue Steps to implement the methodology 1 2 Selecting the appropriate crowdsourcing approaches 3 7 Selecting LD quality issues to crowdsource Designing and generating the interfaces to present the data to the crowd 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 7. 1 Selecting LD quality issues to crowdsource Three categories of quality problems occur in DBpedia [Zaveri2013] and can be crowdsourced: Incorrect object  Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”. Incorrect data type or language tags  Example: dbpedia:Torishima_Izu_Islands foaf:name “ ”@en. Incorrect link to “external Web pages”  Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink <http://cedarlakedvd.com/> 8 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 8. 2 Selecting appropriate crowdsourcing approaches (1) Find Verify Contest Microtasks LD Experts Difficult task Final prize Workers Easy task Micropayments TripleCheckMate [Kontoskostas2013] MTurk http://mturk.com Adapted from [Bernstein2010] 9 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 9. 3 Presenting the data to the crowd Microtask interfaces: MTurk tasks Incorrect object • Selection of foaf:name or rdfs:label to extract humanreadable descriptions • Values extracted automatically from Wikipedia infoboxes • Link to the Wikipedia article via foaf:isPrimaryTopicOf Incorrect data type or language tag Incorrect outlink • Preview of external pages by implementing HTML iframe 10 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 10. EXPERIMENTAL STUDY 11 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 11. Experimental design • Crowdsourcing approaches: • Find stage: Contest with LD experts • Verify stage: Microtasks (5 assignments) • Creation of a gold standard: • Two of the authors of this paper (MA, AZ) generated the gold standard for all the triples obtained from the contest • Each author independently evaluated the triples • Conflicts were resolved via mutual agreement • Metric: precision 12 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 12. Overall results LD Experts Number of distinct participants Total time Total triples evaluated Total cost 13 28.10.2013 Microtask workers 50 80 3 weeks (predefined) 4 days 1,512 1,073 ~ US$ 400 (predefined) ~ US$ 43 Maribel Acosta - Identifying DBpedia Quality Issues via Crowdsourcing Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 13. Precision results: Incorrect object task • MTurk workers can be used to reduce the error rates of LD experts for the Find stage Triples compared LD Experts MTurk (majority voting: n=5) 509 0.7151 0.8977 • 117 DBpedia triples had predicates related to dates with incorrect/incomplete values: ”2005 Six Nations Championship” Date 12 . • 52 DBpedia triples had erroneous values from the source: ”English (programming language)” Influenced by ? . • • 14 Experts classified all these triples as incorrect Workers compared values against Wikipedia and successfully classified this triples as “correct” 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 14. Precision results: Incorrect data type task Triples compared LD Experts MTurk (majority voting: n=5) 341 0.8270 0.4752 Number of triples 140 Experts TP 120 Experts FP 100 Crowd TP 80 Crowd FP 60 40 20 0 Date English Millimetre Nanometre Number Number with decimals Data types 15 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Second Volt Year Not specified / URI Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 15. Precision results: Incorrect link task Triples compared Baseline LD Experts MTurk (n=5 majority voting) 223 0.2598 0.1525 0.9412 • We analyzed the 189 misclassifications by the experts: 11% 39% Freebase links 50% Wikipedia images External links • The 6% misclassifications by the workers correspond to pages with a language different from English. 16 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 16. Final discussion RQ1: Is it possible to detect quality issues in LD data sets via crowdsourcing mechanisms? Both forms of crowdsourcing can be applied to detect certain LD quality issues RQ2: What type of crowd is most suitable for each type of quality issue? The effort of LD experts must be applied on those tasks demanding specific-domain skills. MTurk crowd was exceptionally good at performing data comparisons RQ3: Which types of errors are made by lay users and experts? Lay users do not have the skills to solve domain-specific tasks, while experts performance is very low on tasks that demand an extra effort (e.g., checking an external page) 17 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 17. CONCLUSIONS & FUTURE WORK 18 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 18. Conclusions & Future Work A crowdsourcing methodology for LD quality assessment: Find stage: LD experts Verify stage: MTurk workers Crowdsourcing approaches are feasible in detecting the studied quality issues Application: Detecting pattern in errors to fix the extraction mechanisms Future Work Conducting new experiments (other quality issues and domains) Integration of the crowd into curation processes and tools 19 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 19. References & Acknowledgements [Bernstein2010] M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, UIST ’10, pages 313–322, New York, NY, USA, 2010. ACM. [Kontoskostas2013] D Kontokostas, A Zaveri, S Auer, J Lehmann. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data . Knowledge Engineering and the Semantic Web, 2013 [Zaveri2013] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality as- sessment methodologies for linked open data. Under review, http://www.semantic-web-journal.net/content/quality-assessmentmethodologies-linked-open-data. 20 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
  • 20. Approach MTurk tasks Incorrect object Verify Find Contest Microtasks LD Experts Difficult task Final prize Workers Easy task Micropayments TripleCheckMate Incorrect data type MTurk Incorrect outlink Results: Precision Object values Data types Interlinks Linked Data experts 0.7151 0.8270 0.1525 MTurk 0.8977 0.4752 0.9412 (majority voting) 21 28.10.2013 Acosta et al. – Crowdsourcing Linked Data Quality Assessment QUESTIONS? Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

Hinweis der Redaktion

  1. As we know, the Linking Open Data cloud is a great source of data. However, the varying quality of Linked Data sets often imposes serious problems to developers aiming to consume and integrate LD in their applications.Keeping aside the factual flaws of the original sources, several quality issues are introduced during the RDFication process. Solution: Include human verification in the process of LD quality assessment in order to detect the quality issues that cannot be easily detected by other meansDirect application: Detecting patterns in errors may allow to identify (and correct) the extraction mechanisms in order
  2. TP = a triple that is identified as “incorrect” by the crowd, and the triple is indeed incorrectFP = a triple identified as “incorrect” by the crowd, but was actually correct in the data set