Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

DATA QUALITY
THE DATA SCIENCE STRUGGLE NOBODY MENTIONS
MAURICE VAN KEULEN
ASSOCIATE PROFESSOR
DATA MANAGEMENT TECHNOLOGY
HEAD OF DATABASE GROUP

 First, a little story
 Data imperfections
 Some general mechanisms
 My research
7 Sept 2017Data quality: the data science struggle nobody mentions 2
WHAT IS THIS TALK ABOUT?

Domain
understanding
Data
understanding
Data
preparation
Modeling
(Analysis)
Deployment
(Use) Data
Evaluation
(Interpretation)
Where the
magic happens

Research on
Pregnancy processes
based on
Electronic Patient Dossiers (EPDs)
of some population of women

The start
1. Data scientist is a parent him/herself
2. Gather records from patient’s EPDs for
pregnancy period from multiple sources
3. Fairly straightforward to identify consults,
tests, scans, conditions
4. Extract and store them

First analysis
and evaluation
1. Analysis with process mining tool
2. Interaction with visualization
3. Interpret results
4. (S)he already sees some interesting patterns

But then
(s)he notices …
and realizes …
 … a broken leg … dozens of specialists …
 Assumption wrong: “all records that belong to
a pregnant woman are related to pregnancy”
 Too many records selected during preparation
 No objective means to ascertain this:
No field ‘related to pregnancy’

and a painstaking
process starts
 Specify complex filter rules
 Inspect samples of (not) selected records
 Repeat!
Quick and
dirty or
thorough?
Never
perfect!
What is good
enough?
How does it
affect results?
fatigue can show
later pregnancy
related or not

Fast forward a bit …
Re-perform analysis
and evaluation
1. Interaction with mining tool and visualization
2. Something strange in the times of consults:
Consult after blood test it prescribed???

Realization
More cleaning
1. Clinician contacted for explanation:
Notes during consult put in EPD in evenings!
2. Modification of EPD record (what is recorded)
≠ actual moment of activity (semantics)
Sequence and duration noise
3. More data cleaning ensues

Complex
Many unexpected surprises in
domain/data understanding
Time-consuming
Most time spent on data preparation
(upto 50-80%)

A data scientist should know and tell you about the
deficiencies in the data and the results
Quick and
dirty or
thorough?
Never
perfect!
What is good
enough?
How does it
affect results?

When presented with analytics results / visualizations,
Let people know you don’t trust any results if they don’t have
good answers to questions like these
How was it
cleaned?
Data quality
problems?
Data
discarded?
How reliable
are these
results?
How does
this affect
the results?

What is it
 Data and specification on parts, substances, etc.
Why is it a problem?
 High requirements on data quality
 Errors and duplicates may be
costly or even pose health risks
Even so, it is a mess
PRODUCT DATA
WHAT IS IT AND WHY IS IT A PROBLEM?

Proposed approach
 Given catalogue / database with data on products
 Gather data on the same products from websites
(many more or less independent sources)
 Consolidate: match, merge and clean
One enriched description
for each product
PRODUCT INFORMATION CLEANING AND ENRICHMENT

PILOT: BALL BEARINGS
1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS

PILOT: BALL BEARINGS
2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE
Get product
pages
Extact data
Consolidate
(match,
merge, clean)

PILOT EXPERIENCES: THE DIRT WE FOUND!!

CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
 Data sample looks fine
 Result of analysis looks
perfectly reasonable
 If you don’t look hard
enough
if you don’t properly pay
attention to it
… you will be unaware
… that you are possibly
looking at significantly
erroneous figures!!!

CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina
… … …
????
Wrong values included
Missing data
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels

Data integration very important in Data Science
 Example purpose: Data enrichment
Many names for the same problem
 Record linkage, Entity resolution, Entity linking,
Semantic duplicates, Data coupling, Data fusion, etc.
DATA INTEGRATION: SEMANTIC ENTITY LINKING
ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM
CustID Name …
1234 P. Jansen …
2345 P. Janssen …
3456 P. Janssen …
… … …
EmpNr Name …
6789 P. Jansen …
4567 P. Jansen …
5678 P. Janssen …
… … …
?
They could all be the same person or all different persons!

Data perspective User perspective
Context
independent
Spelling error
Missing data
Incorrect value
Duplicate data
Inconsistent data format
Syntax violation
Violation of integrity constraints
Heterogeneity of measurement units
Existence of synonyms and
homonyms
Information is inaccessible
Information is insecure
Information is hardly retrievable
Information is difficult to aggregate
Errors in the information
transformation
Context
dependent
Violation of domain constraints
Violation of organization’s business
rules
Violation of company and
government regulations
Violation of constraints provided by
the database administrator
The information is not based on fact
Information is of doubtful credibility
Information presents impartial view
Information is irrelevant to the work
Information is incomplete
Information is compactly represented
Information is hard to manipulate
Information is hard to understand
Information is outdated
CATEGORIZATION OF DATA QUALITY PROBLEMS
SOURCE: P. WOODALL, M. OBERHOFER, A. BOREK, “A CLASSIFICATION OF DATA
QUALITY ASSESSMENT AND IMPROVEMENT METHODS”, JDIQ 3(4), 2014

Data/information quality cannot be measured with one value
 Lot of research on describing desirable properties
 Examples: Accuracy, Correctness, Currency,
Completeness, Relevance, Reliability, etc.
 Metrics: concrete means to estimate score on dims
Quite futile endeavor in my opinion
 200+ such dimensions have been identified,
even more possible metrics
 Little agreement
 Expensive, too complex
DATA QUALITY DIMENSIONS AND METRICS

“The data quality of a data set” is a meaningless notion
 Data quality depends on the purpose!
 What is good enough for one purpose,
may be insufficient for another
Alternative means of measurement
 Define purpose as a set of queries on the data
 Measure quality of the answers
 Purpose determines relevant dimensions & metrics
DATA QUALITY DEPENDS ON THE PURPOSE

Wikipedia: “Process of implementing and developing
technical standards”
 Requires consensus
 Typically quite costly
 Benefits but also downsides (lack of variety,
exceptions, slow evolution, defectors, etc.)
 In essence non-technical solution, but can be
supported by technology
Useful but inherently imperfect itself
STANDARDIZATION

One can do quite a bit of checking
 Profiling of domains: detects improper values
 Profiling keys: values in column unique, expected keys
 Verification of foreign key/primary key relationships
 Verification of constraints and business rules
 Verification of expected dependencies: inclusion,
(conditional) functional dependencies
Detects missing / erroneous rows
 Matching with reference data
 etc.
ERROR DETECTION BY VERIFICATION & PROFILING

What can be automatically discovered?
 Detection of outliers: possible erroneous values
 Similarity matching: possible duplicates
 Inclusion detection: possible inclusion dependencies
➠ possible missing erroneous rows
 Dependency detection: possible (conditional)
functional dependencies
➠ possible erroneous values, possible duplicates
 Prediction: machine learning to predict value based on
other data
➠ possible erroneous values (e.g., categorizations)
There is much much more 
ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY

What can be automatically discovered?
 Detection of outliers: possible erroneous values
 Similarity matching: possible duplicates
 Inclusion detection: possible inclusion dependencies
➠ possible missing erroneous rows
 Dependency detection: possible (conditional)
functional dependencies
➠ possible erroneous values, possible duplicates
 Prediction: machine learning to predict value based on
other data
➠ possible erroneous values (e.g., categorizations)
There is much much more 
ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
Very useful but again inherently imperfect itself

 Manually / semi-automatic / automatic
 Techniques of previous slide can not only be used to
discover errors, but also to automatically clean them;
If sufficiently probable
 duplicate rows can be merged
 erroneous rows deleted
 erroneous values corrected
 Missing rows / values
 Data imputation: fill with predicted values
DATA CLEANING

MY RESEARCH
How do we
humans find
our way in
this mess?

 We have domain knowledge
 We know what to expect
 We know what is likely and what is not
 We compare with what we know and expect
 We doubt
 We (dis)trust
 We learn from others
 We reconsider
 etc.
HOW DO WE HUMANS FIND OUR WAY IN THIS MESS?
Probabilities
Uncertainty
Metadata of context:
source, situation, …
Evidence gathering
User feedback

THREE PRINCIPLES
Let me illustrate 2 & 3 with
an example of combining data

EXAMPLE COMBINING DATA
OBJECTIVE: DETERMINE PREFERRED CUSTOMER (WITH SALES > 100)
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45

VOILA! SEMANTIC DUPLICATES PROBLEM
Car brand Sales
B.M.W. 25
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers …
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’

Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesCar brand ω
d1
d2
d3
d4
d5
d6
o1
o2
o3
o4
SEMANTIC DUPLICATES

MOST DATA IMPERFECTIONS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate detection
tool (similarity matching)

 Looks like ordinary database
 Several “possible”, “most likely” or “approximate”
answers to queries
 Important: Scalability (big data!)
Sales of “preferred customers”
 SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
 Answer: 106 (most likely)
WHAT I HAVE NOW IS A PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)

Data integration very important in Data Science
 Example purpose: Data enrichment
Many names for the same problem
 Record linkage, Entity resolution, Entity linking,
Semantic duplicates, Data coupling, Data fusion, etc.
DATA INTEGRATION: SEMANTIC ENTITY LINKING
ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM
CustID Name …
1234 P. Jansen …
2345 P. Janssen …
3456 P. Janssen …
… … …
EmpNr Name …
6789 P. Jansen …
4567 P. Jansen …
5678 P. Janssen …
… … …
?
They could all be the same person or all different persons!
Remember this slide?

Let’s go for an initial
integration that can readily
and meaningfully be used
Let it improve during use
“Good is good enough”
7 Sept 2017Data quality: the data science struggle nobody mentions
DATA INTEGRATION THE PROBABILISTIC WAY
Use
Gather
evidence
Improve
data quality
Partial data
integration
Enumerate cases for
remaining problems
Store data with
uncertainty in UDBMS
InitialintegrationContinuousimprovement
42

User perceives (part of) source as of doubtful credibility
We humans say “I don’t trust this data”
Rows correct or not, so each with variable & probability
Sources contain conflicting data about (possibly) the same
entities; We humans say “I’m left with some doubt”
We already saw this one: car brand example!
Automatic error discovery tool (AEDT) detects possible
erroneous values or rows
We humans say “I doubt that this is correct”
Alternative rows, each with probability of correctness
TRUST AND DOUBT THE PROBABILISTIC WAY

Little story
 Peak into the life and frustrations of a data scientist
 Awareness for responsible analytics
Data imperfections: What a mess!
 Ball bearings pilot
 Models for data quality (problems)
Some general mechanisms: Still quite a mess!
 Standardization, error detection & discovery, cleaning
… and I talked about my own research
WHAT HAVE I TALKED ABOUT

With this a data scientist can
 not trust certain parts of input data as much as others
 integrate data in a quick and dirty but responsible way
 information about data problems is *in* the data
 only solve data quality problems to the degree needed
 measure data uncertainty and quality
 documentation of data manipulation and its reasons
know how DQ problems affect reliability of the results

(Francis Bacon, 1605)
(Jorge Luis Borges, 1979)

 dripping clock: http://www.gemfive.com
 flower: http://cdn.tinybuddha.com
 big data: http://tr1.cbsistatic.com
 awareness: http://7minutesinthemorning.com
SOURCES OF IMAGES

Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

Similar to Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017 (20)

Recently uploaded

Recently uploaded (20)

Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

Editor's Notes