This paper is a supplementary material for the following article -> Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 303-308). IEEE.
Data quality issues have been topical for many decades. However, a unified data quality theory has not been proposed yet, since many concepts associated with the term “data quality” are not straightforward enough. The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined by using three groups of domain-specific languages (DSLs): (1) the first group uses the concept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed idea proved to be simple enough, but at the same time very effective in identifying data defects, despite the different structures of data sets and the complexity of data. Approbation of the approach demonstrated several advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data quality professionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach lets users analyse the “third-party” data, and (c) data quality can be described at least at two levels of abstraction - informally, using natural language, or formally, including executable program routines or SQL statements.
The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined byusing three groups of domain-specific languages(DSLs): (1) the first group usestheconcept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed ideaproved to be simple enough,but at the same time very effectivein identifyingdata defects, despitethedifferent structures of data sets andthe complexity ofdata. Approbation of the approach demonstratedseveral advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data qualityprofessionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach letsusers analysethe"third-party” data, and (c) data quality can be described at least attwo levelsof abstraction –informally,using natural language,or formally,including executable program routines or SQL statements.
1. A STEP TOWARDS A DATA QUALITY THEORY
The Third International Workshop on Data Science Engineering and its Applications (DSEA 2019)
In conjunction with
The Fifth International Conference on Social Networks Analysis, Management and Security(SNAMS-2019)
Granada, Spain. October 22-25, 2019.
Janis Bicevskis, Anastasija Nikiforova, Zane Bicevska, Ivo Oditis, Girts Karnitis
Faculty of Computing, University of Latvia
Anastasija.Nikiforova@lu.lv
2. Def. I: «Quality» is a desirable goal to be achieved
through management of the production process.
Def. II: «Data quality» is a relative concept, largely
dependent on specific requirements resulting from the
data use.
late 60’s
the data quality issues were firstly researched by statisticians, when mainly
mathematical theory for considering duplicates in statistical data sets was
proposed
late 80’s
the data quality issue has attracted management researchers
early 90’s
computer researchers have begun their own researches, focusing on the data
that are stored in databases and data warehouses, examining how to define,
measure and improve the quality of different types of data, relating the concept
of “data quality” to the “data quality dimension”
nowadays
almost 30 years later, since the data are everywhere and their amount increases
significantly, this issue is still popular and relevant, but, unfortunately, has
not yet been solved
2017
Organizations believe poor data quality to be responsible
for an average of $15 million per year in losses (Gartner)
Data quality weaknesses can lead to huge losses
The aggregate economic impact from applications based
on open data across the EU27 economy is estimated to
be €140 billion annually.
2016
Decisions resulting from bad data cost the US economy
$3.1 trillion dollars per year (IBM)
A BRIEF INSIGHT INTO THE HISTORY
3. [open] data are usually used by wide audience that
may not have deep knowledge in IT or data quality areas
a solution should be simple enough
ensuring particular users with possibility to take part in
the analysis of «third-party» data
for their own purposes
DATA QUALITY
Solution: previously proposed user-oriented data object-driven
approach
(Bicevskis, Bicevska, Nikiforova, Oditis, 2018), (Nikiforova, 2019)
!!! The same data may be
sufficiently qualitative in one case
BUT
completely useless under other
circumstances.
4. RELATED RESEARCHES
Problem I: necessity to involve data quality experts at every stage of data quality analysis
process.
Solution: data object-driven approach to data quality evaluation (Bicevskis, Bicevska, Nikiforova, Oditis, 2018)
Problem II: absence of data quality theory.
* «… This state of affairs has led to much confusion within the data quality
community and is even more bewildering for those who are new to the
discipline and more importantly to business stakeholders…» (DAMA UK,
2018)
** In different proposals, dimensions of the same name can have different
semantics and vice versa. (Batini, 2016)
Example I: (Kerr, et al., 2007):
New Zealand’s healthcare data:
6 data quality dimensions,
24 characteristics
69 data quality criteria.
Example II: (Dahbi et al., 2018; Weiskopf et al.,
2013):
2 data quality dimensions: accuracy
and completeness
Most of the theoretical researches are characterized by a wide range of data and information quality
dimensions:
✘ data quality theoretical studies have not provided a unified system of data quality concepts yet*;
✘ the exact meaning of each dimension and how it should be assesd is still under discussion**;
✘ different proposals often use the same notation indicating semantically different dimensions and
vice versa.
✘ sometimes the difference between some of them is almost unnoticeable.
✘ each dimension can be supplied with one or more metrics that varies from one solution to another;
✘ the number of different dimensions, their definitions are often useful for only particular solution.
Question: How to relate particular dimension (and which one?) to a particular use-case???
5. SUMMARY
This research is of a theoretical nature, the main objectives of which are:
to provide a clear and straightforward definition of data quality concepts to ensure that all stakeholders perceive
them equally,
to provide a language family that will describe the data quality requirements and assess the quality of data, taking
into account the various possible uses of the data and their variability over time.
to provide a formalisation of the previously proposed practical solution to take a step towards a theory of data
quality, which hasn’t been proposed yet, despite numerous attempts.
6. TDQM data quality lifecycle
Data quality
definition
Data quality
measuring
Data quality
analysis
Data quality
improvement
MAIN PRINCIPLES OF THE
PROPOSED SOLUTION
Each specific application can have its own specific DQ checks;
DQ requirements can be formulated on several levels:
DQ can be checked in various stages of the data processing;
DQ definition language is graphical DSL:
• the diagrams are easy to read, create, understand and edit even by non-
IT and non-DQ experts;
• syntax and semantics can be easily applied to any new IS.
from informal text
in natural language (PIM)
to an automatically executable model,
SQL statements or program code (PSM);
7. !!! All three components are
defined by using a graphical
domain specific language
(DSL)**
**Three DSL families were developed as graphic languages based on the
possibilities of the modelling platform DIMOD
2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a
data object is considered of high quality.
** May contain: informal or formalized implementation-independent descriptions of conditions
3. DATA QUALITY MEASURING PROCESS - procedure to be followed to assess quality of the data
DATA QUALITY MODEL
instead of dimensions
1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object
primary data object - the initial DO which quality is analysed;
secondary data object – DO that determines the context for analysis of the primary DO.
both, primary and secondary DOs may contain an unlimited number of data sub-objects.
* Many objects of the same structure form class of data objects
** The primary data object is usually one, but the number of secondary data objects is not limited and determined by the
nature of the primary data object and the specific use-case.
d1
d2
d3
d4
dn
d..
9. DO is a set of attribute values that characterize
one real object.
The address for the attribute value of a
single data object is
<dataObjectName.attributeName> - is
used at the stage of determining data
quality requirements.
Can be formulated at different levels of abstraction:
from the formal language grammar
to definitions of variables in programming languages.
DATA OBJECT
Students Programs
inputMessage
studentName
varchar
courseCode
varchar
progCode
varchar
Name
varchar
Success
Assessment
enumerable
Date
date
courseCode
varchar
Assessment
enumerable
Date
date
Courses
Code
varchar
Name
varchar
Name
varchar
Code
varchar
Primary DO
Secondary data object
Data sub-object
10. In order to include quality requirements in the
contextual requirements, addresses of the
secondary data object’s parameters are used in
the appropriate conditions:
<secondaryDataObjectName(instanceIdent).
attributeName>.
If the secondary data object should be searched
for by its attribute values, a secondary data
object search command similar to the primary
data object is used: <instanceIdent = seekInst
(secondaryObjectName, expression)>.
When processing a data object class:
instances of the data object class are selected,
examining the fulfilment of the quality requirements for each individual instance.
The instance processing cycle is determined by users.
The most commonly used options
If quality is analysed for all instances of a DO
reviewing all class instances by changing address
<dataObjectName(instanceIdent).attributeName>,
that is (a) calculated first by selecting the first instance
using method: instanceIdent =
getFirst(dataObjectName),
(b) followed by a transition to the next instance using
<instanceIdent = getNext(dataObjectName)> method.
If quality is analysed for only one instance of a DO
using a dynamically calculated address
<instanceIdent = seekInst(dataObject,
expression)>,
If an instance of a DO is found, then (a) a reference to the
DO is inserted into the variable instanceIdent,
(b) the value TRUE is returned to the environment;
otherwise – FALSE and NULL is inserted into the variable.
QUALITY SPECIFICATION FOR DATA
OBJECT’S CLASS
11. When processing a data object class:
instances of the data object class are selected,
examining the fulfilment of the quality
requirements for each individual instance.
DQ requirements are defined by using logical
expressions.
The names of DO attributes/ fields serve as
operands in the logical expressions.
PRE-CONDITION QUALITY DEFINITIONS
Check Course
instProgram = seekInst(Programs,'Programs.Code =
Students(instStudent).progCode')
Check Student
instStudent = seekInst(Students,'Students.Name =
inputMessage.studentName')
Check Course
instCourse = seekInst(Programs(instProgram).Courses,
'Courses.Code = inputMessage.courseCode')
Send Message
sendMessage(18,
inputMessage.courseCode)
Send Message
sendMessage(19,
inputMessage.courseCode)
Send Message
sendMessage(17,
inputMessage.studentName)
YES
YES
YES
NO
NO
NO
Pre-condition verifies (bold lines in «DO»):
whether a student to whom inputMessage
applies exists;
whether a student is registered to any training
program;
whether the course specified in inputMessage
belongs to training program.
Students Programs
inputMessage
studentName
varchar
courseCode
varchar
progCode
varchar
Name
varchar
Success
Assessment
enumerable
Date
date
courseCode
varchar
Assessment
enumerable
Date
date
Courses
Code
varchar
Name
varchar
Name
varchar
Code
varchar
If quality is analysed for all instances of a DO If quality is analysed for only one instance of a DO
review all class instances by changing address
<dataObjectName(instanceIdent).attributeName>,
that is (a) calculated first by selecting the first instance using method:
<instanceIdent = getFirst(dataObjectName)>,
(b) followed by a transition to the next instance using <instanceIdent =
getNext(dataObjectName)> method.
using a dynamically calculated address
<instanceIdent = seekInst(dataObject, expression)>,
If an instance of a DO is found, then
(a) a reference to the DO is inserted into the variable instanceIdent,
(b) the value TRUE is returned to the environment;
otherwise – FALSE and NULL is inserted into the variable.
12. A concrete DO or a class of DO is used as an
input for a quality verification process.
The quality verification process creates a test
protocol.
EXAMPLE: POST-CONDITION
QUALITY DEFINITIONS
Check Course Insert
instSuccess = seekInst(Students(instStudent).Success,
'Success.courseCode = inputMessage.courseCode)
Check Assessment Insert
Success(instSuccess).Assessment =
inputMessage.Assessment
Check Date Insert
Success(instSuccess).Date = inputMessage.Date
Send Message
sendMessage(23,
inputMessage.Date)
Send Message
sendMessage(22,
inputMessage.Assessment)
Send Message
sendMessage(21,
inputMessage.courseCode)
Seek Student
instStudent = seekInst(Students, 'Student.Name = inputMessage.studentName')
YES
YES
YES
NO
NO
NO
Post-condition is executed after Data_Input and
it verifies (thin arrows in Fig. «DO»):
whether a new instance has been added to
the Student sub-object Success data object;
whether a new instance with the
corresponding course assessment has been
added to the Student sub-object Success
data item;
whether a new instance with the
corresponding exam date has been added to
the Student data object Success sub-object.
Students Programs
inputMessage
studentName
varchar
courseCode
varchar
progCode
varchar
Name
varchar
Success
Assessment
enumerable
Date
date
courseCode
varchar
Assessment
enumerable
Date
date
Courses
Code
varchar
Name
varchar
Name
varchar
Code
varchar
13. In total: 25 data sets 23 (92%) have at least several data quality issues;
The most popular and frequently occurred data quality issues:
✘ lack of values even for the primary parameters;
✘ doubtful/ invalid dates;
✘ issues in interrelated parameters;
✘ multiple notation for the same object;
✘ values that don’t belong to the list of valid values;
✘ contextual data quality issues such as lack of values and conflicting values;
EXPERIENCE OF EVALUATION
OF OPEN DATA QUALITY
structured and semi-structured
open data sets provided by
different data publishers;
the data quality requirements
formulated for each data set vary
from very simple to fairly complex.
14. The research proposes a data-object driven theory of data quality, which arose from previous studies, eliminating the
lack of formalization.
An end-user who is interested in data quality analysis according to his needs is set into the centre of
the data quality analysis.
The most significant advantages:
all concepts of the proposed data quality theory are straightforward;
the proposed approach is an «external» mechanism that allows describing the DQ and veryfying the applicability of data to a
specific use case independently from the IS accumulating and processing data;
the use of graphical DSLs simplifies the interaction process by allowing multiple stakeholders to be involved;
designing of diagrams is fairly simple it is assumed that DQ analysis can be performed even by non-IT and non-DQ experts;
the appliance of the proposed solution for the analysis of “third-party” data sets proves the simplicity and effectiveness of the
proposed solution.
RESULTS
15. THANK YOU FOR ATTENTION!
For more information, see ResearchGate
See also anastasijanikiforova.com
For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv
Article: Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data
quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security
(SNAMS) (pp. 303-308). IEEE.