SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Towards a Vocabulary for
  DQM in Semantic Web
      Architectures
                 (Research in Progress)

        Christian Fürber and Martin Hepp
       christian@fuerber.com, mhepp@computer.org

Presentation @ 1st International Workshop on Linked Web
                    Data Management,
           March 25th, 2011, Uppsala, Sweden
Part 1:
                      What‘s the Problem?



C. Fürber, M. Hepp:                         2
Towards a Vocabulary for DQM
In SemWeb Architectures
Various Data Quality Problems
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification




                                                                                                                           Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  Incorrect reference                                                                      Approximate duplicates




                                                                                                                               Reference: Linking Open Data cloud diagram, by
                                                          Character alignment violation

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements         Untyped literals        Outdated values


C. Fürber, M. Hepp:                                                                                                    3
Towards a Vocabulary for DQM
in SemWeb Architectures
The Problem
                                                                                        Negative
                                                                                        Population


                                                                           Weird Population
                                                                           Values


                                                                                              Invalid
                                                                                              URL‘s

                                Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                  4
Towards a Vocabulary for DQM
in SemWeb Architectures
Part 2:
        What are high quality data?



C. Fürber, M. Hepp:                   5
Towards a Vocabulary for DQM
In SemWeb Architectures
What is Data Quality?
• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)

• „Conformance to specification“ (Kahn et al. 2002)
• „Data are of high quality if they are fit for their intended
  uses in operations, decision making, and planning. Data
  are fit for use if they are free of defects and possess
  desired features.“ (Redman 2001)


                    • Requirements as „Benchmark“
C. Fürber, M. Hepp:                                              6
Towards a Vocabulary for DQM
in SemWeb Architectures
Perspective-Neutral Data Quality


              Data quality is the degree to which
               data fulfills quality requirements

        …no matter who makes the quality requirements.



C. Fürber, M. Hepp:                                 7
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-
   Requirements
                                    The Problem
                                    Population
                                    cannot be                                                    Negative
                                     negative                                                    Population
                            Population is
                            indicated by
                           numeric values                                           Weird Population
                                                                                    Values
                        URL‘s usually
                       start with http://,
                         https://, etc.                                                                Invalid
                                                                                                       URL‘s

                                         Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                           8
Towards a Vocabulary for DQM
in SemWeb Architectures
Satisfying Quality Requirements
         Problem 3: Satisfying
            Requirements            Desired
                                     State

                                                            Individuals

       Status
        Quo
                               =   Desired
                                    State
                                                             Groups


                                    Desired
                                     State
                                                           Standards,
                                                              etc.
  Problem 2: Harmonizing
       Requirements                           Problem 1: Expressing
                                              Quality Requirements
C. Fürber, M. Hepp:                                               9
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 3:
                               Research Goal



C. Fürber, M. Hepp:                            10
Towards a Vocabulary for DQM
In SemWeb Architectures
Major Research Goal
 • Represent Quality-Relevant information for
   automated…
                       – Data Quality Monitoring
                       – Data Quality Assessment
                       – Data Cleansing
                       – Filtering of High Quality Data

                                 …in a standardized vocabulary.


C. Fürber, M. Hepp:                                               11
Towards a Vocabulary for DQM
in SemWeb Architectures
Motives for DQM-Vocabulary
• Support people to explicitly express data quality
  requirements in „same language“ on Web-Scale
• Support the creation of consensual agreements
  upon quality requirements
• Reduce effort for DQM-Activities
• Raise transparency about assumed quality
  requirements
• Enable consistency checks among quality
  requirements
C. Fürber, M. Hepp:                              12
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 4:
                               Our Approach



C. Fürber, M. Hepp:                           13
Towards a Vocabulary for DQM
In SemWeb Architectures
Basic Architecture
                                 Assessment   HQ Data
      Problem                      Scores     Retrieval           Cleansed
    Classification                                                  Data


                                  SPARQL-Query-Engine
                                              DQM-Vocabulary



                          Knowledgebase
                        RDB A     RDB B        Data Acquisition

C. Fürber, M. Hepp:                                                          14
Towards a Vocabulary for DQM
in SemWeb Architectures
Main Concepts of DQM-Vocabulary
                               Classify Quality     Express
                                  Problems        Requirements

                                                                 Annotate
                                                                  Quality
                                                                  Scores




                                                                  Express
                                                                 Cleansing
     Account for                                                   Tasks
   Task-Dependent
    Requirements
C. Fürber, M. Hepp:                                                   15
Towards a Vocabulary for DQM
In SemWeb Architectures
Data Quality Problem Types:
          Source for Potential Requirements
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification
  Incorrect reference                                     Character alignment violation
                                                                                           Approximate duplicates

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements                                 Outdated values
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM                                                                                           16
in SemWeb Architectures
Data Quality Requirements
                                      Syntactical Rules
                                      Semantic Rules
                                     Redundancy Rules
                                    Completeness Rules
                                      Timeliness Rules




C. Fürber, M. Hepp:                                  17
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-Influencing Artifacts


        Current Focus
     of DQM-Vocabulary
                                    Data




C. Fürber, M. Hepp:                            18
Towards a Vocabulary for DQM
In SemWeb Architectures
Design Alternatives:
   Statements about Classes & Properties


(1) Using classes and properties as subjects

(2) Using datatype properties with xsd:anyURI

(3) Mapping class and property URI‘s to new URI‘s


C. Fürber, M. Hepp:                             19
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 5:
                    Application Examples



C. Fürber, M. Hepp:                        20
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (1/3)


               What instances have illegal values
                 for property foo:country ?




C. Fürber, M. Hepp:                                 21
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (2/3)
                               dqm:LegalValueRule          Class
                                                          Instance

                                                         Literal value
                                  foo:LegalValueRule_1




   “tref:Countries“
                                                          “foo:Countries“



        “tref:countryName“                               “foo:countryName“



C. Fürber, M. Hepp:                                                  22
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (3/3)




C. Fürber, M. Hepp:                        23
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (1/2)


               How syntactically accurate are all
                 properties that are subject to
                      LegalValueRules?




C. Fürber, M. Hepp:                                 24
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (2/2)




C. Fürber, M. Hepp:                      25
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 6:
                               Conclusions &
                               Planned Work


C. Fürber, M. Hepp:                            26
Towards a Vocabulary for DQM
In SemWeb Architectures
Advantages of DQM-Voabulary

• Minimizes human effort for DQM
• Web-Scale sharing/reuse of data quality
  requirements
• Consistency checks among data quality
  requirements
• Transparency about applied data quality
  rules
C. Fürber, M. Hepp:                         27
Towards a Vocabulary for DQM
In SemWeb Architectures
Limitations
• Representation of complex functional
  dependency rules and derivation rules
• Limited experience on real world-data sets
• Currently no own concepts for classes and
  properties
• Research still in progress


C. Fürber, M. Hepp:                          28
Towards a Vocabulary for DQM
In SemWeb Architectures
Future Work
• Evaluation of design alternatives
• Development of processing framework
• Representation of more complex
  functional dependency rules / derivation
  rules
• Extension of DQM-Vobulary
• Evaluation on real-world data sets
• Publication at http://semwebquality.org
C. Fürber, M. Hepp:                          29
Towards a Vocabulary for DQM
in SemWeb Architectures
Christian Fürber
   Researcher
   E-Business & Web Science Research Group

                 Werner-Heisenberg-Weg 39
                 85577 Neubiberg
                 Germany

                 skype            c.fuerber
                 email            christian@fuerber.com
                 web              http://www.unibw.de/ebusiness
                 homepage         http://www.fuerber.com
                 twitter          http://www.twitter.com/cfuerber




Paper available at http://bit.ly/gYEDdQ
                                                                    30

Weitere ähnliche Inhalte

Was ist angesagt?

AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
Amazon Web Services Korea
 
Ankus 제품소개서
Ankus 제품소개서Ankus 제품소개서
Ankus 제품소개서
onycom1
 
Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms
butest
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 

Was ist angesagt? (18)

AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
 
TDM: Masking, Subsetting and generating Synthetic Data
TDM: Masking, Subsetting and generating Synthetic Data TDM: Masking, Subsetting and generating Synthetic Data
TDM: Masking, Subsetting and generating Synthetic Data
 
멀티 클라우드 시대의 정보보호 관리체계
멀티 클라우드 시대의 정보보호 관리체계멀티 클라우드 시대의 정보보호 관리체계
멀티 클라우드 시대의 정보보호 관리체계
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
Ankus 제품소개서
Ankus 제품소개서Ankus 제품소개서
Ankus 제품소개서
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
AWS 6월 웨비나 | Amazon VPC Deep Dive (김상필 솔루션즈아키텍트)
AWS 6월 웨비나 | Amazon VPC Deep Dive (김상필 솔루션즈아키텍트)AWS 6월 웨비나 | Amazon VPC Deep Dive (김상필 솔루션즈아키텍트)
AWS 6월 웨비나 | Amazon VPC Deep Dive (김상필 솔루션즈아키텍트)
 
Student Performance Data Mining Project Report
Student Performance Data Mining Project ReportStudent Performance Data Mining Project Report
Student Performance Data Mining Project Report
 
현대백화점 리테일테크랩과 AWS Prototyping 팀 개발자가 들려주는 인공 지능 무인 스토어 개발 여정 - 최권열 AWS 프로토타이핑...
현대백화점 리테일테크랩과 AWS Prototyping 팀 개발자가 들려주는 인공 지능 무인 스토어 개발 여정 - 최권열 AWS 프로토타이핑...현대백화점 리테일테크랩과 AWS Prototyping 팀 개발자가 들려주는 인공 지능 무인 스토어 개발 여정 - 최권열 AWS 프로토타이핑...
현대백화점 리테일테크랩과 AWS Prototyping 팀 개발자가 들려주는 인공 지능 무인 스토어 개발 여정 - 최권열 AWS 프로토타이핑...
 
Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Google Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine LearningGoogle Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine Learning
 
[Games on AWS 2019] AWS 사용자를 위한 만랩 달성 트랙 | AWS에서 분산 서비스 거부 공격(DDoS)을 고민하지 않는 ...
[Games on AWS 2019] AWS 사용자를 위한 만랩 달성 트랙 | AWS에서 분산 서비스 거부 공격(DDoS)을 고민하지 않는 ...[Games on AWS 2019] AWS 사용자를 위한 만랩 달성 트랙 | AWS에서 분산 서비스 거부 공격(DDoS)을 고민하지 않는 ...
[Games on AWS 2019] AWS 사용자를 위한 만랩 달성 트랙 | AWS에서 분산 서비스 거부 공격(DDoS)을 고민하지 않는 ...
 
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
ECS to EKS 마이그레이션 경험기 - 유용환(Superb AI) :: AWS Community Day Online 2021
 
글로벌 고객 사례를 통하여 소개하는 혁신적인 데이터 웨어하우스 - 김형일 (AWS 솔루션즈 아키텍트)
글로벌 고객 사례를 통하여 소개하는 혁신적인 데이터 웨어하우스 - 김형일 (AWS 솔루션즈 아키텍트)글로벌 고객 사례를 통하여 소개하는 혁신적인 데이터 웨어하우스 - 김형일 (AWS 솔루션즈 아키텍트)
글로벌 고객 사례를 통하여 소개하는 혁신적인 데이터 웨어하우스 - 김형일 (AWS 솔루션즈 아키텍트)
 
Hulu Case Study
Hulu Case StudyHulu Case Study
Hulu Case Study
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

  • 1. Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden
  • 2. Part 1: What‘s the Problem? C. Fürber, M. Hepp: 2 Towards a Vocabulary for DQM In SemWeb Architectures
  • 3. Various Data Quality Problems Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated values C. Fürber, M. Hepp: 3 Towards a Vocabulary for DQM in SemWeb Architectures
  • 4. The Problem Negative Population Weird Population Values Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 4 Towards a Vocabulary for DQM in SemWeb Architectures
  • 5. Part 2: What are high quality data? C. Fürber, M. Hepp: 5 Towards a Vocabulary for DQM In SemWeb Architectures
  • 6. What is Data Quality? • Data‘s „fitness for use by data consumers“ (Wang, Strong 1996) • „Conformance to specification“ (Kahn et al. 2002) • „Data are of high quality if they are fit for their intended uses in operations, decision making, and planning. Data are fit for use if they are free of defects and possess desired features.“ (Redman 2001) • Requirements as „Benchmark“ C. Fürber, M. Hepp: 6 Towards a Vocabulary for DQM in SemWeb Architectures
  • 7. Perspective-Neutral Data Quality Data quality is the degree to which data fulfills quality requirements …no matter who makes the quality requirements. C. Fürber, M. Hepp: 7 Towards a Vocabulary for DQM In SemWeb Architectures
  • 8. Quality- Requirements The Problem Population cannot be Negative negative Population Population is indicated by numeric values Weird Population Values URL‘s usually start with http://, https://, etc. Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 8 Towards a Vocabulary for DQM in SemWeb Architectures
  • 9. Satisfying Quality Requirements Problem 3: Satisfying Requirements Desired State Individuals Status Quo = Desired State Groups Desired State Standards, etc. Problem 2: Harmonizing Requirements Problem 1: Expressing Quality Requirements C. Fürber, M. Hepp: 9 Towards a Vocabulary for DQM In SemWeb Architectures
  • 10. Part 3: Research Goal C. Fürber, M. Hepp: 10 Towards a Vocabulary for DQM In SemWeb Architectures
  • 11. Major Research Goal • Represent Quality-Relevant information for automated… – Data Quality Monitoring – Data Quality Assessment – Data Cleansing – Filtering of High Quality Data …in a standardized vocabulary. C. Fürber, M. Hepp: 11 Towards a Vocabulary for DQM in SemWeb Architectures
  • 12. Motives for DQM-Vocabulary • Support people to explicitly express data quality requirements in „same language“ on Web-Scale • Support the creation of consensual agreements upon quality requirements • Reduce effort for DQM-Activities • Raise transparency about assumed quality requirements • Enable consistency checks among quality requirements C. Fürber, M. Hepp: 12 Towards a Vocabulary for DQM In SemWeb Architectures
  • 13. Part 4: Our Approach C. Fürber, M. Hepp: 13 Towards a Vocabulary for DQM In SemWeb Architectures
  • 14. Basic Architecture Assessment HQ Data Problem Scores Retrieval Cleansed Classification Data SPARQL-Query-Engine DQM-Vocabulary Knowledgebase RDB A RDB B Data Acquisition C. Fürber, M. Hepp: 14 Towards a Vocabulary for DQM in SemWeb Architectures
  • 15. Main Concepts of DQM-Vocabulary Classify Quality Express Problems Requirements Annotate Quality Scores Express Cleansing Account for Tasks Task-Dependent Requirements C. Fürber, M. Hepp: 15 Towards a Vocabulary for DQM In SemWeb Architectures
  • 16. Data Quality Problem Types: Source for Potential Requirements Inconsistent duplicates Invalid characters Missing classification Incorrect reference Character alignment violation Approximate duplicates Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Outdated values C. Fürber, M. Hepp: Towards a Vocabulary for DQM 16 in SemWeb Architectures
  • 17. Data Quality Requirements Syntactical Rules Semantic Rules Redundancy Rules Completeness Rules Timeliness Rules C. Fürber, M. Hepp: 17 Towards a Vocabulary for DQM In SemWeb Architectures
  • 18. Quality-Influencing Artifacts Current Focus of DQM-Vocabulary Data C. Fürber, M. Hepp: 18 Towards a Vocabulary for DQM In SemWeb Architectures
  • 19. Design Alternatives: Statements about Classes & Properties (1) Using classes and properties as subjects (2) Using datatype properties with xsd:anyURI (3) Mapping class and property URI‘s to new URI‘s C. Fürber, M. Hepp: 19 Towards a Vocabulary for DQM In SemWeb Architectures
  • 20. Part 5: Application Examples C. Fürber, M. Hepp: 20 Towards a Vocabulary for DQM In SemWeb Architectures
  • 21. Example 1: Legal Value Rule (1/3) What instances have illegal values for property foo:country ? C. Fürber, M. Hepp: 21 Towards a Vocabulary for DQM In SemWeb Architectures
  • 22. Example 1: Legal Value Rule (2/3) dqm:LegalValueRule Class Instance Literal value foo:LegalValueRule_1 “tref:Countries“ “foo:Countries“ “tref:countryName“ “foo:countryName“ C. Fürber, M. Hepp: 22 Towards a Vocabulary for DQM In SemWeb Architectures
  • 23. Example 1: Legal Value Rule (3/3) C. Fürber, M. Hepp: 23 Towards a Vocabulary for DQM In SemWeb Architectures
  • 24. Example 2: DQ-Assessment (1/2) How syntactically accurate are all properties that are subject to LegalValueRules? C. Fürber, M. Hepp: 24 Towards a Vocabulary for DQM In SemWeb Architectures
  • 25. Example 2: DQ-Assessment (2/2) C. Fürber, M. Hepp: 25 Towards a Vocabulary for DQM In SemWeb Architectures
  • 26. Part 6: Conclusions & Planned Work C. Fürber, M. Hepp: 26 Towards a Vocabulary for DQM In SemWeb Architectures
  • 27. Advantages of DQM-Voabulary • Minimizes human effort for DQM • Web-Scale sharing/reuse of data quality requirements • Consistency checks among data quality requirements • Transparency about applied data quality rules C. Fürber, M. Hepp: 27 Towards a Vocabulary for DQM In SemWeb Architectures
  • 28. Limitations • Representation of complex functional dependency rules and derivation rules • Limited experience on real world-data sets • Currently no own concepts for classes and properties • Research still in progress C. Fürber, M. Hepp: 28 Towards a Vocabulary for DQM In SemWeb Architectures
  • 29. Future Work • Evaluation of design alternatives • Development of processing framework • Representation of more complex functional dependency rules / derivation rules • Extension of DQM-Vobulary • Evaluation on real-world data sets • Publication at http://semwebquality.org C. Fürber, M. Hepp: 29 Towards a Vocabulary for DQM in SemWeb Architectures
  • 30. Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerber Paper available at http://bit.ly/gYEDdQ 30