SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
{RDF} Data Quality Assessment
Connecting the pieces...
Dimitris Kontokostas
Senior Knowledge Engineer
Connected Data London 2018 - Nov 7th 2018
About me...
● Data geek, software engineer & open source enthusiast
● PhD in knowledge extraction and quality assessment
● Involved in graph-related standardization activities (ShEx/SHACL)
● Author of the RDFUnit Java library
● Co-author of “Validating RDF Data” book
● Working on the GeoPhy Real Estate Knowledge Graph
Overview
● Attempt to define data quality
● Identify data quality issues
● Means for tackling them
What is
data quality?
What is
??? quality?
Quality of life...
Image credits
Quality of OS
Multidimensional
image credits
Data Quality is
Which one is better?
ex:Foo
a dbo:Person ;
dbo:birthDate ”2000-01-01”^^xsd:date .
ex:Bar
a foaf:Person ;
foaf:age 18 .
ex:Baz
wkd:p31 wk:Q5 ;
wkd:p569 ”2000-01-01”^^xsd:date .
Would you use this information for …
ex:Chickenpox
a ex:InfectiousDisease ;
ex:symptoms ”rash”, “fever”, “headache” ;
ex:treatWithVaccine ex:VaricellaVaccine .
ex:VaricellaVaccine
a ex:Vaccine ;
ex:treats ex:Chickenpox, ex:HerpesZoster .
- a visualization?
- a disease website?
- automated treatment?
Fitness for use
Data Quality is
Data Quality Dimension themes
Accessibility: accessing & retrieving data, complete or part of
Contextual: depend on the use-case context or consumer preference
Intrinsic: independent of context
Representational: related to data design
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Accessibility Dimensions
Availability can you access the data?
Licence can you use the data?
Performance can you get the data in reasonable time?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Contextual Dimensions
Relevancy does it cover your needs?
Trustworthiness do you trust the publisher?
Understandability do you understand the data? Is there documentation?
Timeliness is the data stalled or up to date?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Intrinsic Dimensions
Semantically valid are there any syntax errors?
Semantically accurate are there outliers, misused labels?
Consistent are there inconsistencies?
Concise are there duplicates and/or ambiguity, NULLs?
Complete are records or values missing?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Representational Dimensions
Interoperable are terms/labels/vocabularies reused?
Interpretable is it self-descriptive?
Versatile is it provided in multiple formats / languages?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
How good do you need it to get?
There is a great costs in:
> assessing the quality of dataset
> improving the quality of dataset
Costs is highly dependent on whether:
> data & assets are outside of your control
> data & assets are within your control
> data & assets are bought
More or less than what you need impacts costs and/or product
Quality Cost ($)
Where things
can go wrong
Where data
can go wrong
Where data can go wrong
Source data
Master schema
Mappings
Validation Rules
Identity Resolution
Data Fusion
Source data
can be (semi) unstructured
can be messy
cannot fit into a/the schema
Master schema
Incorrect modeling
Incomplete modeling
Inaccurate translation
> to owl, rdfs, ShEx, SHACL, etc
Undesired expressivity
> RDFS, OWL: DL/RL/FULL, etc
Mappings
Incorrect mapping
errors scale to the source size (up to millions)
Incomplete mapping
Software bugs
conversion scripts, ETL code, etc
Model sync
> port schema updates
Validation Rules
Incorrect translation
> birthDate max cardinality 1
> birthDate min cardinality 1
Syntax error & typos
> dirthDate must be xsd:date
Model sync
> port schema updates
Evolution & quality
↻
↻
↻
↻
↻
↻
See http://aligned-project.eu
Sounds good so
far… now what?
Strategies for managing quality
Data testers
> explicit / implicit roles
Crowdsourcing
> field experts vs MTurk
Executable validation rules
> SHACL, ShEx, OWL
See Acosta et al. Detecting Linked Data quality issues via crowdsourcing: A DBpedia study
kontokostas et al. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data / Demo
> Needs good tool support
> Generic tools missing
> Validation engines improved
Validate closer to the source of the error
↻
↻
↻
↻
↻
see Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality
Kontokostas et al. Semantically Enhanced Quality Assurance in theJURION Business Use Case
> Always in the K range
> Scales with source size
> Errors scale as well
Automate, automate & automate...
ex:name
a rdf:Property ;
rdfs:range rdf:langString .
Schema.ttl
ex:Foo
a dbo:Person ;
ex:name “Foo @en” .
Data.ttl
Automate, automate & automate...
ex:name
a rdf:Property ;
rdfs:range rdf:langString .
Schema.ttl
ex:Foo
a dbo:Person ;
ex:name “Foo @en” .
ex:name “Foo”@en .
Data.ttl
CI/CD is your best friend
Treat data as code
> Jenkins, Travis, GitLab, TeamCity, ...
Trigger validation on every (single) change
> Fail the build until data issues are fixed
Create (data) integration tests
Just like in software…
> Green CI <> No Errors/bugs
> Green CI => Not enough tests
Recap
> Data quality is fitness for use
> Can be assessed with multiple dimensions
> Identify the quality you need
> Also look for errors in the schema, the rules and the mappings
> Validate closer to the error source
> Automate as much as possible
Thank you! Questions?
@jimkont
kontokostas.com
slideshare.net/jimkont

Weitere ähnliche Inhalte

Was ist angesagt?

What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?andimou
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics togetherJeff Fried
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsConnected Data World
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesSrinath Srinivasa
 
Adding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIsAdding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIsMichael Petychakis
 
Quality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databasesQuality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databasesieeechennai
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...semanticsconference
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphGraphAware
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine LearningDatabricks
 
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistEthics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistStratos Kontopoulos
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...semanticsconference
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
 
RDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkRDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkKhan Mostafa
 
Integrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionIntegrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionJuan Sequeda
 
Iterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineIterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineMartin Magdinier
 

Was ist angesagt? (20)

What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics together
 
Semantic web an overview and projects
Semantic web   an  overview and projectsSemantic web   an  overview and projects
Semantic web an overview and projects
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent Apps
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and Opportunities
 
Adding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIsAdding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIs
 
Quality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databasesQuality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databases
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge Graph
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine Learning
 
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistEthics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
 
Ethical solutions services
Ethical solutions servicesEthical solutions services
Ethical solutions services
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...
 
RDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkRDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 framework
 
Integrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionIntegrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A Reflection
 
Iterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineIterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refine
 

Ähnlich wie RDF Data Quality Assessment - connecting the pieces

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyRTTS
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...Pieter De Leenheer
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob contentJeff Fried
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityJaveriaGauhar
 
STC Information Topology
STC Information TopologySTC Information Topology
STC Information TopologyTyrinAvery1
 
TechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdfTechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdfdebbieholley1
 
Human vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfHuman vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfDawn Anderson MSc DigM
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Lucidworks
 

Ähnlich wie RDF Data Quality Assessment - connecting the pieces (20)

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Data Quality
Data QualityData Quality
Data Quality
 
Data Quality
Data QualityData Quality
Data Quality
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob content
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
 
Measurement And Validation
Measurement And ValidationMeasurement And Validation
Measurement And Validation
 
KBART update ER&L 2009
KBART update ER&L 2009KBART update ER&L 2009
KBART update ER&L 2009
 
ER&L KBART Update
ER&L KBART UpdateER&L KBART Update
ER&L KBART Update
 
STC Information Topology
STC Information TopologySTC Information Topology
STC Information Topology
 
TechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdfTechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdf
 
Human vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfHuman vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdf
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 

Mehr von Connected Data World

Systems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van HarmelenSystems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van HarmelenConnected Data World
 
Graph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora LassilaGraph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora LassilaConnected Data World
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Connected Data World
 
How to get started with Graph Machine Learning
How to get started with Graph Machine LearningHow to get started with Graph Machine Learning
How to get started with Graph Machine LearningConnected Data World
 
The years of the graph: The future of the future is here
The years of the graph: The future of the future is hereThe years of the graph: The future of the future is here
The years of the graph: The future of the future is hereConnected Data World
 
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2Connected Data World
 
From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3Connected Data World
 
In Search of the Universal Data Model
In Search of the Universal Data ModelIn Search of the Universal Data Model
In Search of the Universal Data ModelConnected Data World
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseConnected Data World
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Connected Data World
 
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Connected Data World
 
Semantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleSemantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleConnected Data World
 
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Connected Data World
 
Schema, Google & The Future of the Web
Schema, Google & The Future of the WebSchema, Google & The Future of the Web
Schema, Google & The Future of the WebConnected Data World
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsConnected Data World
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
 
Graph for Good: Empowering your NGO
Graph for Good: Empowering your NGOGraph for Good: Empowering your NGO
Graph for Good: Empowering your NGOConnected Data World
 

Mehr von Connected Data World (20)

Systems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van HarmelenSystems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van Harmelen
 
Graph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora LassilaGraph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora Lassila
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
 
How to get started with Graph Machine Learning
How to get started with Graph Machine LearningHow to get started with Graph Machine Learning
How to get started with Graph Machine Learning
 
Graphs in sustainable finance
Graphs in sustainable financeGraphs in sustainable finance
Graphs in sustainable finance
 
The years of the graph: The future of the future is here
The years of the graph: The future of the future is hereThe years of the graph: The future of the future is here
The years of the graph: The future of the future is here
 
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
 
From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3
 
In Search of the Universal Data Model
In Search of the Universal Data ModelIn Search of the Universal Data Model
In Search of the Universal Data Model
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
 
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
 
Semantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleSemantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scale
 
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
 
Schema, Google & The Future of the Web
Schema, Google & The Future of the WebSchema, Google & The Future of the Web
Schema, Google & The Future of the Web
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property Graphs
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
Graph for Good: Empowering your NGO
Graph for Good: Empowering your NGOGraph for Good: Empowering your NGO
Graph for Good: Empowering your NGO
 

Kürzlich hochgeladen

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

RDF Data Quality Assessment - connecting the pieces

  • 1. {RDF} Data Quality Assessment Connecting the pieces... Dimitris Kontokostas Senior Knowledge Engineer Connected Data London 2018 - Nov 7th 2018
  • 2. About me... ● Data geek, software engineer & open source enthusiast ● PhD in knowledge extraction and quality assessment ● Involved in graph-related standardization activities (ShEx/SHACL) ● Author of the RDFUnit Java library ● Co-author of “Validating RDF Data” book ● Working on the GeoPhy Real Estate Knowledge Graph
  • 3. Overview ● Attempt to define data quality ● Identify data quality issues ● Means for tackling them
  • 9. Which one is better? ex:Foo a dbo:Person ; dbo:birthDate ”2000-01-01”^^xsd:date . ex:Bar a foaf:Person ; foaf:age 18 . ex:Baz wkd:p31 wk:Q5 ; wkd:p569 ”2000-01-01”^^xsd:date .
  • 10. Would you use this information for … ex:Chickenpox a ex:InfectiousDisease ; ex:symptoms ”rash”, “fever”, “headache” ; ex:treatWithVaccine ex:VaricellaVaccine . ex:VaricellaVaccine a ex:Vaccine ; ex:treats ex:Chickenpox, ex:HerpesZoster . - a visualization? - a disease website? - automated treatment?
  • 11. Fitness for use Data Quality is
  • 12. Data Quality Dimension themes Accessibility: accessing & retrieving data, complete or part of Contextual: depend on the use-case context or consumer preference Intrinsic: independent of context Representational: related to data design See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 13. Accessibility Dimensions Availability can you access the data? Licence can you use the data? Performance can you get the data in reasonable time? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 14. Contextual Dimensions Relevancy does it cover your needs? Trustworthiness do you trust the publisher? Understandability do you understand the data? Is there documentation? Timeliness is the data stalled or up to date? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 15. Intrinsic Dimensions Semantically valid are there any syntax errors? Semantically accurate are there outliers, misused labels? Consistent are there inconsistencies? Concise are there duplicates and/or ambiguity, NULLs? Complete are records or values missing? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 16. Representational Dimensions Interoperable are terms/labels/vocabularies reused? Interpretable is it self-descriptive? Versatile is it provided in multiple formats / languages? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 17. How good do you need it to get? There is a great costs in: > assessing the quality of dataset > improving the quality of dataset Costs is highly dependent on whether: > data & assets are outside of your control > data & assets are within your control > data & assets are bought More or less than what you need impacts costs and/or product Quality Cost ($)
  • 20. Where data can go wrong Source data Master schema Mappings Validation Rules Identity Resolution Data Fusion
  • 21. Source data can be (semi) unstructured can be messy cannot fit into a/the schema
  • 22. Master schema Incorrect modeling Incomplete modeling Inaccurate translation > to owl, rdfs, ShEx, SHACL, etc Undesired expressivity > RDFS, OWL: DL/RL/FULL, etc
  • 23. Mappings Incorrect mapping errors scale to the source size (up to millions) Incomplete mapping Software bugs conversion scripts, ETL code, etc Model sync > port schema updates
  • 24. Validation Rules Incorrect translation > birthDate max cardinality 1 > birthDate min cardinality 1 Syntax error & typos > dirthDate must be xsd:date Model sync > port schema updates
  • 27. Strategies for managing quality Data testers > explicit / implicit roles Crowdsourcing > field experts vs MTurk Executable validation rules > SHACL, ShEx, OWL See Acosta et al. Detecting Linked Data quality issues via crowdsourcing: A DBpedia study kontokostas et al. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data / Demo > Needs good tool support > Generic tools missing > Validation engines improved
  • 28. Validate closer to the source of the error ↻ ↻ ↻ ↻ ↻ see Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality Kontokostas et al. Semantically Enhanced Quality Assurance in theJURION Business Use Case > Always in the K range > Scales with source size > Errors scale as well
  • 29. Automate, automate & automate... ex:name a rdf:Property ; rdfs:range rdf:langString . Schema.ttl ex:Foo a dbo:Person ; ex:name “Foo @en” . Data.ttl
  • 30. Automate, automate & automate... ex:name a rdf:Property ; rdfs:range rdf:langString . Schema.ttl ex:Foo a dbo:Person ; ex:name “Foo @en” . ex:name “Foo”@en . Data.ttl
  • 31. CI/CD is your best friend Treat data as code > Jenkins, Travis, GitLab, TeamCity, ... Trigger validation on every (single) change > Fail the build until data issues are fixed Create (data) integration tests Just like in software… > Green CI <> No Errors/bugs > Green CI => Not enough tests
  • 32. Recap > Data quality is fitness for use > Can be assessed with multiple dimensions > Identify the quality you need > Also look for errors in the schema, the rules and the mappings > Validate closer to the error source > Automate as much as possible