This presentations is a supplementary material for presenting the "Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business" (authored by Anastasija Nikiforova and Natalija Kozmina) research paper during the The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021), November 15-16, 2021. Tartu, Estonia (web-based)
Read paper here -> Nikiforova, A., & Kozmina, N. (2021, November). Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business. In 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 66-73). IEEE -> https://ieeexplore.ieee.org/abstract/document/9660802?casa_token=LFJa20LrXAwAAAAA:wVwhTcCPWqxdloAvDQ3-l98KkkLx70xzG3zNvIIkJbC6wvJ4VxwX_VGc3mmW_7c1T-QJlOtTiao
â„đ 7737669865 đâ» Mathura Call-girls in Women Seeking Men đMathurađ Escorts...
Â
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business
1. STAKEHOLDER-CENTRED IDENTIFICATION OF
DATA QUALITY ISSUES:
KNOWLEDGE THAT CAN SAVE YOUR BUSINESS
The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021)
November 15-16, 2021. Tartu, Estonia (web-based)
Anastasija Nikiforova, Natalija Kozmina
âInnovative Information Technologiesâ Laboratory, Programming Department
Faculty of Computing, University of Latvia
2. AIM & RESEARCH QUESTIONS
(RQ1) What are the main data quality issues to be considered when conducting data quality analysis?
(RQ2) What do users with advanced data quality knowledge think of a list of defined data quality issues and requirements as
a result of the literature analysis, i.e., are all these issues important in their view?
(RQ3) Are the data quality requirements identified while answering previous RQs valid for real-world data?
(RQ4) What is the list of data quality requirements to be included in the data quality analysis and in the
specification of the data quality tool?
The goal of this study is to determine the most common data quality issues (i.e.,
defects) that affect users' experience with data and their reuse, as well as intent for
their use in the future, potentially resulting in financial losses for businesses.
19% of businesses had lost
their customers using inaccurate
or incomplete data in 2019
âGlobal Marketing Alliance, The cost of bad
data: have you done the math?â, 2020
The 2020 edition of âMagic
Quadrant for Data Quality
Solutionsâ found that organizations
estimate the average cost of poor
data quality at more than $12
million per year
Gartner Magic Quadrant for Data Quality Solutions,
2020,
3. RELATED RESEARCHES
âąÂ«âŠ This state of affairs has led to much confusion within the data quality community and is even
more bewildering for those who are new to the discipline and more importantly to business
stakeholdersâŠÂ»
(DAMA UK, 2018)
** In different proposals, dimensions of the same name can have different semantics and vice versa.
(Batini, 2016)
General studies on data and information quality - define different
dimensions of quality and their groupings
â The key data quality dimensions are not universally agreed upon*;
â There is no agreement on their meanings and usability **;
â Each dimension can be supplied with one or more metrics that varies from
one solution to another;
â The number of different data quality dimensions, their definitions and
grouping are often useful for only particular solution.
Question: How to relate particular dimension (and which one?) to a particular use-case???
5. Step Ia: results of the literature review Step Ib: results of the brainstorming session, identifying
and removing duplicates (30 DQ-users)
Step II: results of DELPHI
analysis (12 experts)
(Laranjeiro et al., 2015) - 22 studies
(Scannapieco et al., 2002) â 6 studies
(ISO/IEC, 2008)
(Torchiano et al., 2017)
(Rafique et al., 2012)
(Askham et al., 2013)
(Utamachant et al., 2018)
(Wang and Strong, 1996)
1.accuracy/ correctness
2.objectivity
3.reputation/ traceability
4.believability/ credibility
5.timeliness
6.completeness
7.relevancy
8.value-added
9.interpretability
10.access security
11.currentness
12.representational consistency
13.consistency/ concise representation
14.accessibility
15.precision
16.efficiency
17.recoverability
18.portability
19.response time
20.adequacy
21.confidentiality (privacy, security)
22.understandability (ease of understanding, interpretability)
1.accuracy/ correctness
2.traceability
3.believability/ credibility
4.timeliness, currentness
5.completeness
6.consistency
7.accessibility
8.confidentiality/
privacy, security
9.understandability
(ease of understanding, clarity,
interpretability)
DATA QUALITY DIMENSIONS: 2-STEP IDENTIFICATION
6. Dimension* Level
DT/DS
Data quality issue associated
accuracy/
correctness
DT Incorrect/inaccurate values that do not belong to the domain
Misspelling
Precision
Special characters
Duplicates/uniqueness violations
Incorrect references
Different aggregation levels
traceability DS
DT
untraceable
believability/
credibility
DS non-credible
timeliness,
currentness
DS
DT
Outdated temporal data
completeness DT Missing value
... ... ...
DATA QUALITY DIMENSIONS AND ASSOCIATED DATA
QUALITY ISSUES IDENTIFIED (PART I)
*For definition of each dimension we have used, please, refer to the article
7. Dimension Level DT/DS Data quality issue associated
... ... ...
consistency DS
DT
Different representations (intra-relational constraint)
Different word orderings between values of one attribute
Use of synonyms / multiple notation for one object in scope of one attribute
Use of synonyms / multiple notation for one object in scope of different datasets
Different encoding formats, Wrong data type
Different aggregation levels
Different units
Special characters
accessibility DS Special characters
Misspelling, Different encoding formats
Different aggregation levels
Different units
Use of synonyms / multiple notation for one object in scope of different datasets
Bulk download
confidentiality/
privacy, security
DS unsecure / non-confidential
understandability
(ease of understanding, clarity,
interpretability)
DS
DT
unclear
DATA QUALITY DIMENSIONS AND ASSOCIATED DATA
QUALITY ISSUES IDENTIFIED (PART II)
8. Step I: results of the literature review Step II: results of the brainstorming session, identifying and
removing duplicates (30 DQ-users)
Step III: results of DELPHI analysis
(12 experts)
(Laranjeiro et al., 2015) - 22 studies
(Scannapieco et al., 2002) â 6 studies
(ISO/IEC, 2008)
(Torchiano et al., 2017)
(Rafique et al., 2012)
(Askham et al., 2013)
(Utamachant et al., 2018)
(Wang and Strong, 1996)
1.accuracy/ correctness
2.objectivity
3.reputation/ traceability
4.believability/ credibility
5.timeliness
6.completeness
7.relevancy
8.value-added
9.interpretability
10.access security
11.currentness
12.representational consistency
13.consistency/ concise representation
14.accessibility
15.precision
16.efficiency
17.recoverability
18.portability
19.response time
20.adequacy
21.confidentiality (privacy, security)
22.understandability (ease of understanding, interpretability)
1.accuracy/ correctness
2.traceability
3.believability/ credibility
4.timeliness, currentness
5.completeness
6.consistency
7.accessibility
8.confidentiality/
privacy, security
9.understandability
(ease of understanding, clarity,
interpretability)
DATA QUALITY DIMENSIONS: STEP III
9. Data quality problem in question Frequency of
checks
(datasets)
Frequency of issues in DS
(#defective data sets/#total)
Frequency of issues
(#defective parameters/ #total)
QD1: Incorrect/inaccurate values that does not belong to the
domain
40.00% 16.67% 15.38%
QD1: Misspelling 86.67% 7.69% 3.33%
QD1: Precision 40.00% 0 0
QD1: Special characters 10% 13.33% 25.93%
QD1: Duplicates / uniqueness violations 93.33% 28.57% 18.18%
QD1: Incorrect references 80.00% 16.67% 13.33%
QD1: Different aggregation levels 80.00% 16.67% 13.33%
QD2: Traceability (DT) 66.67% 0 0
QD2: Traceability (DS) 93.33% 14.29% 6.67%
QD3: Believability/ credibility 100% 13.33% 2.27%
QD4: Outdated temporal data (DT) 93.33% 7.14% 10.00%
QD4: Outdated temporal data (DS) 93.33% 64.29% 28.82%
QD5: Completeness 93.33% 64.29% 28.82%
... ... ... ...
RESULTS OF APPLYING DATA QUALITY REQUIREMENTS
TO OPEN GOVERNMENT DATA (part I)
10. Data quality problem in question Frequency of checks
(datasets)
Frequency of issues in
DS (#defective data
sets/#total)
Frequency of
issues (#defective
parameters/ #total)
QD6: Different representations (Intra-relational constraint) 86.67% 61.54% 61.90%
QD6: Different word orderings between values of one attribute 93.33% 42.86% 25.00%
QD6: Use of synonyms / multiple notation for one object in scope of one
attribute
86.67% 61.54% 61.90%
QD6: Use of synonyms / multiple notation for one object in different datasets 93.33% 50.00% 26.32%
QD6:Different encoding formats 80.00% 0 0
QD6: Wrong data type 86.67% 7.69% 0.80%
QD6:Different aggregation levels 46.67% 57.14% 25.93%
QD6: Different units 53.33% 25.00% 21.74%
QD6: Special characters 46.67% 57.14% 25.93%
QD7: Special characters 86.67% 7.69% 8.57%
QD7: Misspelling 90.00% 6.67% 8.33%
QD7: Different encoding formats 33.33% 0 0
QD7: Different aggregation levels 80.00% 8.33% 10.00%
QD7: Different units 80.00% 16.67% 21.74%
QD7: Use of synonyms / multiple notation for one object in scope of different
datasets
86.67% 30.77% 21.74%
QD7: Bulk download 100.00% 20.00% 20.00%
QD8: Confidentiality/ privacy, security 0 0 0
QD9: Understandability (DT) 100.00% 20.00% 11.76%
QD9: Understandability (DS) 100.00% 66.67% 25.93%
11. RESULTS
ïŒThis study has raised and answered 4 research questions:
ïŒthe list of main data quality issues to be considered when conducting data quality analysis was identified in
course of the literature analysis, which was then filtered out during the brainstorming session.
ïŒin terms of the DELPHI analysis with 12 experts the list was reduced to 9 data quality dimensions and 15 data
quality issues mapped to each other, dividing data quality issues into two categories depending on their level,
i.e., data and data set levels.
ïŒ the validity of the data quality issues identified was examined by applying the list of data quality
requirements set in RQ1 and RQ2 to 30 real open government data sets from the Latvian open government data
portal.
ïŒ14 data quality issues to be transformed into requirements for the web-based tool under development
have been identified with 6 more appearing in some cases (<10% of data sets) to be considered for
implementation.
12. CONCLUSIONS I
ïŒThe concept and topic of âdata qualityâ attracts researchers for more than three decades, and its popularity
certainly will not change in the future - the data are not only an integral part of our lives and business. With
the popularity of the open government data, their value now is even higher than ever.
ïŒThe paradigm according to which the data quality control and management is performed in closed systems,
is no longer valid.
ïŒThis leads to the modification of already existing and the development of new data quality dimensions,
their classification, data quality issues, etc.
13. CONCLUSIONS I
ïŒThe results showed that most of the defects are representative for OGD available to each stakeholder.
ïŒThe OGD have data quality issues which, as demonstrated by OGD-related studies, have a negative impact on usersâ
readiness and willingness to re-use these data for their purposes such as innovative service and solutions.
ïŒLet's keep in mind that the data are worth reusing only if they are usable both in terms of their value and quality, otherwise
bringing businesses losses.
ïŒFurther studies on the topic include the development of the web-based data quality analysis tool where the knowledge obtained
during this study will serve as a specification of the functionality to be covered by it.
14. DATA AVAILABILITY
Data are available in Open Access (under CC-BY) ï DOI: https://doi.org/10.5281/zenodo.4604656
https://www.eosc-hub.eu/open-science-info
15. THANK YOU FOR
ATTENTION!
QUESTIONS?
For more information, see ResearchGate
See also anastasijanikiforova.com
For questions or any other queries, contact
me via email - Anastasija.Nikiforova@lu.lv