SlideShare a Scribd company logo
1 of 48
DATA QUALITY
THE DATA SCIENCE STRUGGLE NOBODY MENTIONS
MAURICE VAN KEULEN
ASSOCIATE PROFESSOR
DATA MANAGEMENT TECHNOLOGY
HEAD OF DATABASE GROUP
 First, a little story
 Data imperfections
 Some general mechanisms
 My research
7 Sept 2017Data quality: the data science struggle nobody mentions 2
WHAT IS THIS TALK ABOUT?
7 Sept 2017Data quality: the data science struggle nobody mentions 3
Domain
understanding
Data
understanding
Data
preparation
Modeling
(Analysis)
Deployment
(Use) Data
Evaluation
(Interpretation)
Where the
magic happens
7 Sept 2017Data quality: the data science struggle nobody mentions 4
7 Sept 2017Data quality: the data science struggle nobody mentions 5
Research on
Pregnancy processes
based on
Electronic Patient Dossiers (EPDs)
of some population of women
7 Sept 2017Data quality: the data science struggle nobody mentions 6
The start
1. Data scientist is a parent him/herself
2. Gather records from patient’s EPDs for
pregnancy period from multiple sources
3. Fairly straightforward to identify consults,
tests, scans, conditions
4. Extract and store them
7 Sept 2017Data quality: the data science struggle nobody mentions 7
First analysis
and evaluation
1. Analysis with process mining tool
2. Interaction with visualization
3. Interpret results
4. (S)he already sees some interesting patterns
7 Sept 2017Data quality: the data science struggle nobody mentions 8
But then
(s)he notices …
and realizes …
 … a broken leg … dozens of specialists …
 Assumption wrong: “all records that belong to
a pregnant woman are related to pregnancy”
 Too many records selected during preparation
 No objective means to ascertain this:
No field ‘related to pregnancy’
7 Sept 2017Data quality: the data science struggle nobody mentions 9
and a painstaking
process starts
 Specify complex filter rules
 Inspect samples of (not) selected records
 Repeat!
Quick and
dirty or
thorough?
Never
perfect!
What is good
enough?
How does it
affect results?
fatigue can show
later pregnancy
related or not
7 Sept 2017Data quality: the data science struggle nobody mentions 10
Fast forward a bit …
Re-perform analysis
and evaluation
1. Interaction with mining tool and visualization
2. Something strange in the times of consults:
Consult after blood test it prescribed???
7 Sept 2017Data quality: the data science struggle nobody mentions 11
Realization
More cleaning
1. Clinician contacted for explanation:
Notes during consult put in EPD in evenings!
2. Modification of EPD record (what is recorded)
≠ actual moment of activity (semantics)
Sequence and duration noise
3. More data cleaning ensues
7 Sept 2017Data quality: the data science struggle nobody mentions 12
Complex
Many unexpected surprises in
domain/data understanding
Time-consuming
Most time spent on data preparation
(upto 50-80%)
7 Sept 2017Data quality: the data science struggle nobody mentions 13
A data scientist should know and tell you about the
deficiencies in the data and the results
Quick and
dirty or
thorough?
Never
perfect!
What is good
enough?
How does it
affect results?
7 Sept 2017Data quality: the data science struggle nobody mentions 14
When presented with analytics results / visualizations,
Let people know you don’t trust any results if they don’t have
good answers to questions like these
How was it
cleaned?
Data quality
problems?
Data
discarded?
How reliable
are these
results?
How does
this affect
the results?
DATA IMPERFECTIONS
What is it
 Data and specification on parts, substances, etc.
Why is it a problem?
 High requirements on data quality
 Errors and duplicates may be
costly or even pose health risks
Even so, it is a mess
7 Sept 2017Data quality: the data science struggle nobody mentions 16
PRODUCT DATA
WHAT IS IT AND WHY IS IT A PROBLEM?
Proposed approach
 Given catalogue / database with data on products
 Gather data on the same products from websites
(many more or less independent sources)
 Consolidate: match, merge and clean
One enriched description
for each product
7 Sept 2017Data quality: the data science struggle nobody mentions 17
PRODUCT INFORMATION CLEANING AND ENRICHMENT
7 Sept 2017Data quality: the data science struggle nobody mentions 18
PILOT: BALL BEARINGS
1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS
7 Sept 2017Data quality: the data science struggle nobody mentions 19
PILOT: BALL BEARINGS
2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE
Get product
pages
Extact data
Consolidate
(match,
merge, clean)
7 Sept 2017Data quality: the data science struggle nobody mentions 20
PILOT EXPERIENCES: THE DIRT WE FOUND!!
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
7 Sept 2017Data quality: the data science struggle nobody mentions 21
DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
 Data sample looks fine
 Result of analysis looks
perfectly reasonable
 If you don’t look hard
enough
if you don’t properly pay
attention to it
… you will be unaware
… that you are possibly
looking at significantly
erroneous figures!!!
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
7 Sept 2017Data quality: the data science struggle nobody mentions 22
DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina
… … …
????
Wrong values included
Missing data
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels
Data integration very important in Data Science
 Example purpose: Data enrichment
Many names for the same problem
 Record linkage, Entity resolution, Entity linking,
Semantic duplicates, Data coupling, Data fusion, etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 23
DATA INTEGRATION: SEMANTIC ENTITY LINKING
ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM
CustID Name …
1234 P. Jansen …
2345 P. Janssen …
3456 P. Janssen …
… … …
EmpNr Name …
6789 P. Jansen …
4567 P. Jansen …
5678 P. Janssen …
… … …
?
They could all be the same person or all different persons!
Data perspective User perspective
Context
independent
Spelling error
Missing data
Incorrect value
Duplicate data
Inconsistent data format
Syntax violation
Violation of integrity constraints
Heterogeneity of measurement units
Existence of synonyms and
homonyms
Information is inaccessible
Information is insecure
Information is hardly retrievable
Information is difficult to aggregate
Errors in the information
transformation
Context
dependent
Violation of domain constraints
Violation of organization’s business
rules
Violation of company and
government regulations
Violation of constraints provided by
the database administrator
The information is not based on fact
Information is of doubtful credibility
Information presents impartial view
Information is irrelevant to the work
Information is incomplete
Information is compactly represented
Information is hard to manipulate
Information is hard to understand
Information is outdated
7 Sept 2017Data quality: the data science struggle nobody mentions 24
CATEGORIZATION OF DATA QUALITY PROBLEMS
SOURCE: P. WOODALL, M. OBERHOFER, A. BOREK, “A CLASSIFICATION OF DATA
QUALITY ASSESSMENT AND IMPROVEMENT METHODS”, JDIQ 3(4), 2014
Data/information quality cannot be measured with one value
 Lot of research on describing desirable properties
 Examples: Accuracy, Correctness, Currency,
Completeness, Relevance, Reliability, etc.
 Metrics: concrete means to estimate score on dims
Quite futile endeavor in my opinion
 200+ such dimensions have been identified,
even more possible metrics
 Little agreement
 Expensive, too complex
DATA QUALITY DIMENSIONS AND METRICS
7 Sept 2017Data quality: the data science struggle nobody mentions 25
“The data quality of a data set” is a meaningless notion
 Data quality depends on the purpose!
 What is good enough for one purpose,
may be insufficient for another
Alternative means of measurement
 Define purpose as a set of queries on the data
 Measure quality of the answers
 Purpose determines relevant dimensions & metrics
7 Sept 2017Data quality: the data science struggle nobody mentions 26
DATA QUALITY DEPENDS ON THE PURPOSE
SOME GENERAL MECHANISMS
Wikipedia: “Process of implementing and developing
technical standards”
 Requires consensus
 Typically quite costly
 Benefits but also downsides (lack of variety,
exceptions, slow evolution, defectors, etc.)
 In essence non-technical solution, but can be
supported by technology
Useful but inherently imperfect itself
STANDARDIZATION
7 Sept 2017Data quality: the data science struggle nobody mentions 28
One can do quite a bit of checking
 Profiling of domains: detects improper values
 Profiling keys: values in column unique, expected keys
 Verification of foreign key/primary key relationships
 Verification of constraints and business rules
 Verification of expected dependencies: inclusion,
(conditional) functional dependencies
Detects missing / erroneous rows
 Matching with reference data
 etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 29
ERROR DETECTION BY VERIFICATION & PROFILING
What can be automatically discovered?
 Detection of outliers: possible erroneous values
 Similarity matching: possible duplicates
 Inclusion detection: possible inclusion dependencies
➠ possible missing erroneous rows
 Dependency detection: possible (conditional)
functional dependencies
➠ possible erroneous values, possible duplicates
 Prediction: machine learning to predict value based on
other data
➠ possible erroneous values (e.g., categorizations)
There is much much more 
7 Sept 2017Data quality: the data science struggle nobody mentions 30
ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
What can be automatically discovered?
 Detection of outliers: possible erroneous values
 Similarity matching: possible duplicates
 Inclusion detection: possible inclusion dependencies
➠ possible missing erroneous rows
 Dependency detection: possible (conditional)
functional dependencies
➠ possible erroneous values, possible duplicates
 Prediction: machine learning to predict value based on
other data
➠ possible erroneous values (e.g., categorizations)
There is much much more 
7 Sept 2017Data quality: the data science struggle nobody mentions 31
ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
Very useful but again inherently imperfect itself
 Manually / semi-automatic / automatic
 Techniques of previous slide can not only be used to
discover errors, but also to automatically clean them;
If sufficiently probable
 duplicate rows can be merged
 erroneous rows deleted
 erroneous values corrected
 Missing rows / values
 Data imputation: fill with predicted values
7 Sept 2017Data quality: the data science struggle nobody mentions 32
DATA CLEANING
MY RESEARCH
How do we
humans find
our way in
this mess?
 We have domain knowledge
 We know what to expect
 We know what is likely and what is not
 We compare with what we know and expect
 We doubt
 We (dis)trust
 We learn from others
 We reconsider
 etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 34
HOW DO WE HUMANS FIND OUR WAY IN THIS MESS?
Probabilities
Uncertainty
Metadata of context:
source, situation, …
Evidence gathering
User feedback
THREE PRINCIPLES
Let me illustrate 2 & 3 with
an example of combining data
7 Sept 2017Data quality: the data science struggle nobody mentions 35
7 Sept 2017Data quality: the data science struggle nobody mentions 36
EXAMPLE COMBINING DATA
OBJECTIVE: DETERMINE PREFERRED CUSTOMER (WITH SALES > 100)
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
7 Sept 2017Data quality: the data science struggle nobody mentions 37
VOILA! SEMANTIC DUPLICATES PROBLEM
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers …
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’
Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesCar brand ω
d1
d2
d3
d4
d5
d6
o1
o2
o3
o4
7 Sept 2017Data quality: the data science struggle nobody mentions 38
SEMANTIC DUPLICATES
7 Sept 2017Data quality: the data science struggle nobody mentions 39
MOST DATA IMPERFECTIONS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate detection
tool (similarity matching)
 Looks like ordinary database
 Several “possible”, “most likely” or “approximate”
answers to queries
 Important: Scalability (big data!)
Sales of “preferred customers”
 SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
 Answer: 106 (most likely)
7 Sept 2017Data quality: the data science struggle nobody mentions 40
WHAT I HAVE NOW IS A PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Data integration very important in Data Science
 Example purpose: Data enrichment
Many names for the same problem
 Record linkage, Entity resolution, Entity linking,
Semantic duplicates, Data coupling, Data fusion, etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 41
DATA INTEGRATION: SEMANTIC ENTITY LINKING
ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM
CustID Name …
1234 P. Jansen …
2345 P. Janssen …
3456 P. Janssen …
… … …
EmpNr Name …
6789 P. Jansen …
4567 P. Jansen …
5678 P. Janssen …
… … …
?
They could all be the same person or all different persons!
Remember this slide?
Let’s go for an initial
integration that can readily
and meaningfully be used
Let it improve during use
“Good is good enough”
7 Sept 2017Data quality: the data science struggle nobody mentions
DATA INTEGRATION THE PROBABILISTIC WAY
Use
Gather
evidence
Improve
data quality
Partial data
integration
Enumerate cases for
remaining problems
Store data with
uncertainty in UDBMS
InitialintegrationContinuousimprovement
42
User perceives (part of) source as of doubtful credibility
We humans say “I don’t trust this data”
Rows correct or not, so each with variable & probability
Sources contain conflicting data about (possibly) the same
entities; We humans say “I’m left with some doubt”
We already saw this one: car brand example!
Automatic error discovery tool (AEDT) detects possible
erroneous values or rows
We humans say “I doubt that this is correct”
Alternative rows, each with probability of correctness
7 Sept 2017Data quality: the data science struggle nobody mentions 45
TRUST AND DOUBT THE PROBABILISTIC WAY
CONCLUSIONS
Little story
 Peak into the life and frustrations of a data scientist
 Awareness for responsible analytics
Data imperfections: What a mess!
 Ball bearings pilot
 Models for data quality (problems)
Some general mechanisms: Still quite a mess!
 Standardization, error detection & discovery, cleaning
… and I talked about my own research
7 Sept 2017Data quality: the data science struggle nobody mentions 55
WHAT HAVE I TALKED ABOUT
7 Sept 2017Data quality: the data science struggle nobody mentions 56
With this a data scientist can
 not trust certain parts of input data as much as others
 integrate data in a quick and dirty but responsible way
 information about data problems is *in* the data
 only solve data quality problems to the degree needed
 measure data uncertainty and quality
 documentation of data manipulation and its reasons
know how DQ problems affect reliability of the results
7 Sept 2017Data quality: the data science struggle nobody mentions 57
(Francis Bacon, 1605)
(Jorge Luis Borges, 1979)
 dripping clock: http://www.gemfive.com
 flower: http://cdn.tinybuddha.com
 big data: http://tr1.cbsistatic.com
 awareness: http://7minutesinthemorning.com
7 Sept 2017Data quality: the data science struggle nobody mentions 58
SOURCES OF IMAGES

More Related Content

What's hot

What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsHugo Bowne-Anderson
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...DATAVERSITY
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
 
Introduction to Big Data & Analytics
Introduction to Big Data & AnalyticsIntroduction to Big Data & Analytics
Introduction to Big Data & AnalyticsPrasad Chitta
 
SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSteven Kimber
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureWake Tech BAS
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data ScienceRoger Huang
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyLyn Fenex
 

What's hot (20)

What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
 
Analytics 2
Analytics 2Analytics 2
Analytics 2
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
ForresterPredictiveWave
ForresterPredictiveWaveForresterPredictiveWave
ForresterPredictiveWave
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Introduction to Big Data & Analytics
Introduction to Big Data & AnalyticsIntroduction to Big Data & Analytics
Introduction to Big Data & Analytics
 
Classification of data
Classification of dataClassification of data
Classification of data
 
SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data Analytics
 
Carrying out analysis
Carrying out analysisCarrying out analysis
Carrying out analysis
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 Lecture
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st Century
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 

Similar to Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practicesRobert Oostenveld
 
How to source good data
How to source good dataHow to source good data
How to source good dataSolveXia
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
The Importance of Data Cleaning Maximizing Insights and Decision-Making
The Importance of Data Cleaning Maximizing Insights and Decision-MakingThe Importance of Data Cleaning Maximizing Insights and Decision-Making
The Importance of Data Cleaning Maximizing Insights and Decision-MakingJosephine Lester Broadstock
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practicesGregory Nelson
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Pharmaceutical Data Integrity@IBA Karachi
Pharmaceutical Data Integrity@IBA KarachiPharmaceutical Data Integrity@IBA Karachi
Pharmaceutical Data Integrity@IBA KarachiCepal & Co.
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration James Hendler
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Natalino Busa
 
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGMETA DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGIJCSEIT Journal
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Innovation and data analytics
Innovation and data analyticsInnovation and data analytics
Innovation and data analyticsYeshoda Bhargava
 
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryBioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryWolfgang G. Hoeck
 
Spreadsheets & Databases
Spreadsheets & DatabasesSpreadsheets & Databases
Spreadsheets & DatabasesSina Soltani
 
Spreadsheets and databases
Spreadsheets and databasesSpreadsheets and databases
Spreadsheets and databasesxsight
 
Predictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallPredictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallDATAVERSITY
 
Pistoia Alliance debates AI in life science
Pistoia Alliance debates AI in life sciencePistoia Alliance debates AI in life science
Pistoia Alliance debates AI in life sciencePistoia Alliance
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistLisa Cohen
 
Ghanem and pape's presentation
Ghanem and pape's presentationGhanem and pape's presentation
Ghanem and pape's presentationPape Samb
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataPrecisely
 

Similar to Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017 (20)

Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 
How to source good data
How to source good dataHow to source good data
How to source good data
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
The Importance of Data Cleaning Maximizing Insights and Decision-Making
The Importance of Data Cleaning Maximizing Insights and Decision-MakingThe Importance of Data Cleaning Maximizing Insights and Decision-Making
The Importance of Data Cleaning Maximizing Insights and Decision-Making
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practices
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Pharmaceutical Data Integrity@IBA Karachi
Pharmaceutical Data Integrity@IBA KarachiPharmaceutical Data Integrity@IBA Karachi
Pharmaceutical Data Integrity@IBA Karachi
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGMETA DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Innovation and data analytics
Innovation and data analyticsInnovation and data analytics
Innovation and data analytics
 
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryBioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
 
Spreadsheets & Databases
Spreadsheets & DatabasesSpreadsheets & Databases
Spreadsheets & Databases
 
Spreadsheets and databases
Spreadsheets and databasesSpreadsheets and databases
Spreadsheets and databases
 
Predictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallPredictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal Ball
 
Pistoia Alliance debates AI in life science
Pistoia Alliance debates AI in life sciencePistoia Alliance debates AI in life science
Pistoia Alliance debates AI in life science
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
Ghanem and pape's presentation
Ghanem and pape's presentationGhanem and pape's presentation
Ghanem and pape's presentation
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

  • 1. DATA QUALITY THE DATA SCIENCE STRUGGLE NOBODY MENTIONS MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP
  • 2.  First, a little story  Data imperfections  Some general mechanisms  My research 7 Sept 2017Data quality: the data science struggle nobody mentions 2 WHAT IS THIS TALK ABOUT?
  • 3. 7 Sept 2017Data quality: the data science struggle nobody mentions 3 Domain understanding Data understanding Data preparation Modeling (Analysis) Deployment (Use) Data Evaluation (Interpretation) Where the magic happens
  • 4. 7 Sept 2017Data quality: the data science struggle nobody mentions 4
  • 5. 7 Sept 2017Data quality: the data science struggle nobody mentions 5 Research on Pregnancy processes based on Electronic Patient Dossiers (EPDs) of some population of women
  • 6. 7 Sept 2017Data quality: the data science struggle nobody mentions 6 The start 1. Data scientist is a parent him/herself 2. Gather records from patient’s EPDs for pregnancy period from multiple sources 3. Fairly straightforward to identify consults, tests, scans, conditions 4. Extract and store them
  • 7. 7 Sept 2017Data quality: the data science struggle nobody mentions 7 First analysis and evaluation 1. Analysis with process mining tool 2. Interaction with visualization 3. Interpret results 4. (S)he already sees some interesting patterns
  • 8. 7 Sept 2017Data quality: the data science struggle nobody mentions 8 But then (s)he notices … and realizes …  … a broken leg … dozens of specialists …  Assumption wrong: “all records that belong to a pregnant woman are related to pregnancy”  Too many records selected during preparation  No objective means to ascertain this: No field ‘related to pregnancy’
  • 9. 7 Sept 2017Data quality: the data science struggle nobody mentions 9 and a painstaking process starts  Specify complex filter rules  Inspect samples of (not) selected records  Repeat! Quick and dirty or thorough? Never perfect! What is good enough? How does it affect results? fatigue can show later pregnancy related or not
  • 10. 7 Sept 2017Data quality: the data science struggle nobody mentions 10 Fast forward a bit … Re-perform analysis and evaluation 1. Interaction with mining tool and visualization 2. Something strange in the times of consults: Consult after blood test it prescribed???
  • 11. 7 Sept 2017Data quality: the data science struggle nobody mentions 11 Realization More cleaning 1. Clinician contacted for explanation: Notes during consult put in EPD in evenings! 2. Modification of EPD record (what is recorded) ≠ actual moment of activity (semantics) Sequence and duration noise 3. More data cleaning ensues
  • 12. 7 Sept 2017Data quality: the data science struggle nobody mentions 12 Complex Many unexpected surprises in domain/data understanding Time-consuming Most time spent on data preparation (upto 50-80%)
  • 13. 7 Sept 2017Data quality: the data science struggle nobody mentions 13 A data scientist should know and tell you about the deficiencies in the data and the results Quick and dirty or thorough? Never perfect! What is good enough? How does it affect results?
  • 14. 7 Sept 2017Data quality: the data science struggle nobody mentions 14 When presented with analytics results / visualizations, Let people know you don’t trust any results if they don’t have good answers to questions like these How was it cleaned? Data quality problems? Data discarded? How reliable are these results? How does this affect the results?
  • 16. What is it  Data and specification on parts, substances, etc. Why is it a problem?  High requirements on data quality  Errors and duplicates may be costly or even pose health risks Even so, it is a mess 7 Sept 2017Data quality: the data science struggle nobody mentions 16 PRODUCT DATA WHAT IS IT AND WHY IS IT A PROBLEM?
  • 17. Proposed approach  Given catalogue / database with data on products  Gather data on the same products from websites (many more or less independent sources)  Consolidate: match, merge and clean One enriched description for each product 7 Sept 2017Data quality: the data science struggle nobody mentions 17 PRODUCT INFORMATION CLEANING AND ENRICHMENT
  • 18. 7 Sept 2017Data quality: the data science struggle nobody mentions 18 PILOT: BALL BEARINGS 1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS
  • 19. 7 Sept 2017Data quality: the data science struggle nobody mentions 19 PILOT: BALL BEARINGS 2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE Get product pages Extact data Consolidate (match, merge, clean)
  • 20. 7 Sept 2017Data quality: the data science struggle nobody mentions 20 PILOT EXPERIENCES: THE DIRT WE FOUND!!
  • 21. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 7 Sept 2017Data quality: the data science struggle nobody mentions 21 DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000  Data sample looks fine  Result of analysis looks perfectly reasonable  If you don’t look hard enough if you don’t properly pay attention to it … you will be unaware … that you are possibly looking at significantly erroneous figures!!!
  • 22. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 7 Sept 2017Data quality: the data science struggle nobody mentions 22 DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000 CustID Sales Name 6789 2 Tom 4567 6000 Jon 5678 NULL Nina … … … ???? Wrong values included Missing data Double counting etc. Many more problems at value, record, schema, source, trust levels
  • 23. Data integration very important in Data Science  Example purpose: Data enrichment Many names for the same problem  Record linkage, Entity resolution, Entity linking, Semantic duplicates, Data coupling, Data fusion, etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 23 DATA INTEGRATION: SEMANTIC ENTITY LINKING ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM CustID Name … 1234 P. Jansen … 2345 P. Janssen … 3456 P. Janssen … … … … EmpNr Name … 6789 P. Jansen … 4567 P. Jansen … 5678 P. Janssen … … … … ? They could all be the same person or all different persons!
  • 24. Data perspective User perspective Context independent Spelling error Missing data Incorrect value Duplicate data Inconsistent data format Syntax violation Violation of integrity constraints Heterogeneity of measurement units Existence of synonyms and homonyms Information is inaccessible Information is insecure Information is hardly retrievable Information is difficult to aggregate Errors in the information transformation Context dependent Violation of domain constraints Violation of organization’s business rules Violation of company and government regulations Violation of constraints provided by the database administrator The information is not based on fact Information is of doubtful credibility Information presents impartial view Information is irrelevant to the work Information is incomplete Information is compactly represented Information is hard to manipulate Information is hard to understand Information is outdated 7 Sept 2017Data quality: the data science struggle nobody mentions 24 CATEGORIZATION OF DATA QUALITY PROBLEMS SOURCE: P. WOODALL, M. OBERHOFER, A. BOREK, “A CLASSIFICATION OF DATA QUALITY ASSESSMENT AND IMPROVEMENT METHODS”, JDIQ 3(4), 2014
  • 25. Data/information quality cannot be measured with one value  Lot of research on describing desirable properties  Examples: Accuracy, Correctness, Currency, Completeness, Relevance, Reliability, etc.  Metrics: concrete means to estimate score on dims Quite futile endeavor in my opinion  200+ such dimensions have been identified, even more possible metrics  Little agreement  Expensive, too complex DATA QUALITY DIMENSIONS AND METRICS 7 Sept 2017Data quality: the data science struggle nobody mentions 25
  • 26. “The data quality of a data set” is a meaningless notion  Data quality depends on the purpose!  What is good enough for one purpose, may be insufficient for another Alternative means of measurement  Define purpose as a set of queries on the data  Measure quality of the answers  Purpose determines relevant dimensions & metrics 7 Sept 2017Data quality: the data science struggle nobody mentions 26 DATA QUALITY DEPENDS ON THE PURPOSE
  • 28. Wikipedia: “Process of implementing and developing technical standards”  Requires consensus  Typically quite costly  Benefits but also downsides (lack of variety, exceptions, slow evolution, defectors, etc.)  In essence non-technical solution, but can be supported by technology Useful but inherently imperfect itself STANDARDIZATION 7 Sept 2017Data quality: the data science struggle nobody mentions 28
  • 29. One can do quite a bit of checking  Profiling of domains: detects improper values  Profiling keys: values in column unique, expected keys  Verification of foreign key/primary key relationships  Verification of constraints and business rules  Verification of expected dependencies: inclusion, (conditional) functional dependencies Detects missing / erroneous rows  Matching with reference data  etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 29 ERROR DETECTION BY VERIFICATION & PROFILING
  • 30. What can be automatically discovered?  Detection of outliers: possible erroneous values  Similarity matching: possible duplicates  Inclusion detection: possible inclusion dependencies ➠ possible missing erroneous rows  Dependency detection: possible (conditional) functional dependencies ➠ possible erroneous values, possible duplicates  Prediction: machine learning to predict value based on other data ➠ possible erroneous values (e.g., categorizations) There is much much more  7 Sept 2017Data quality: the data science struggle nobody mentions 30 ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
  • 31. What can be automatically discovered?  Detection of outliers: possible erroneous values  Similarity matching: possible duplicates  Inclusion detection: possible inclusion dependencies ➠ possible missing erroneous rows  Dependency detection: possible (conditional) functional dependencies ➠ possible erroneous values, possible duplicates  Prediction: machine learning to predict value based on other data ➠ possible erroneous values (e.g., categorizations) There is much much more  7 Sept 2017Data quality: the data science struggle nobody mentions 31 ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY Very useful but again inherently imperfect itself
  • 32.  Manually / semi-automatic / automatic  Techniques of previous slide can not only be used to discover errors, but also to automatically clean them; If sufficiently probable  duplicate rows can be merged  erroneous rows deleted  erroneous values corrected  Missing rows / values  Data imputation: fill with predicted values 7 Sept 2017Data quality: the data science struggle nobody mentions 32 DATA CLEANING
  • 33. MY RESEARCH How do we humans find our way in this mess?
  • 34.  We have domain knowledge  We know what to expect  We know what is likely and what is not  We compare with what we know and expect  We doubt  We (dis)trust  We learn from others  We reconsider  etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 34 HOW DO WE HUMANS FIND OUR WAY IN THIS MESS? Probabilities Uncertainty Metadata of context: source, situation, … Evidence gathering User feedback
  • 35. THREE PRINCIPLES Let me illustrate 2 & 3 with an example of combining data 7 Sept 2017Data quality: the data science struggle nobody mentions 35
  • 36. 7 Sept 2017Data quality: the data science struggle nobody mentions 36 EXAMPLE COMBINING DATA OBJECTIVE: DETERMINE PREFERRED CUSTOMER (WITH SALES > 100) Keulen, M. (2012) Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). pp. 138-146. ISSN 1611-2776 Car brand Sales B.M.W. 25 Mercedes 32 Renault 10 Car brand Sales BMW 72 Mercedes-Benz 39 Renault 20 Car brand Sales Bayerische Motoren Werke 8 Mercedes 35 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45
  • 37. 7 Sept 2017Data quality: the data science struggle nobody mentions 37 VOILA! SEMANTIC DUPLICATES PROBLEM Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Preferred customers … SELECT SUM(Sales) FROM CarSales WHERE Sales>100 0 ‘No preferred customers’
  • 38. Database Real world (of car brands) Mercedes-Benz 39 72BMW 45Renault 67Mercedes 8 Bayerische Motoren Werke 25B.M.W. SalesCar brand ω d1 d2 d3 d4 d5 d6 o1 o2 o3 o4 7 Sept 2017Data quality: the data science struggle nobody mentions 38 SEMANTIC DUPLICATES
  • 39. 7 Sept 2017Data quality: the data science struggle nobody mentions 39 MOST DATA IMPERFECTIONS CAN BE MODELED AS UNCERTAINTY IN DATA Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Mercedes 106 Mercedes-Benz 106 1 2 3 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1 X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct name 0.5 Y=1 “Mercedes-Benz” correct name 0.5 B.M.W. / BMW / Bayerische Motoren Werke analogously Run some duplicate detection tool (similarity matching)
  • 40.  Looks like ordinary database  Several “possible”, “most likely” or “approximate” answers to queries  Important: Scalability (big data!) Sales of “preferred customers”  SELECT SUM(sales) FROM carsales WHERE sales≥ 100  Answer: 106 (most likely) 7 Sept 2017Data quality: the data science struggle nobody mentions 40 WHAT I HAVE NOW IS A PROBABILISTIC DATABASE SUM(sales) P 0 14% 105 6% 106 56% 211 24% Second most likely answer at 24% with impact factor 2 in sales (211 vs 106)
  • 41. Data integration very important in Data Science  Example purpose: Data enrichment Many names for the same problem  Record linkage, Entity resolution, Entity linking, Semantic duplicates, Data coupling, Data fusion, etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 41 DATA INTEGRATION: SEMANTIC ENTITY LINKING ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM CustID Name … 1234 P. Jansen … 2345 P. Janssen … 3456 P. Janssen … … … … EmpNr Name … 6789 P. Jansen … 4567 P. Jansen … 5678 P. Janssen … … … … ? They could all be the same person or all different persons! Remember this slide?
  • 42. Let’s go for an initial integration that can readily and meaningfully be used Let it improve during use “Good is good enough” 7 Sept 2017Data quality: the data science struggle nobody mentions DATA INTEGRATION THE PROBABILISTIC WAY Use Gather evidence Improve data quality Partial data integration Enumerate cases for remaining problems Store data with uncertainty in UDBMS InitialintegrationContinuousimprovement 42
  • 43. User perceives (part of) source as of doubtful credibility We humans say “I don’t trust this data” Rows correct or not, so each with variable & probability Sources contain conflicting data about (possibly) the same entities; We humans say “I’m left with some doubt” We already saw this one: car brand example! Automatic error discovery tool (AEDT) detects possible erroneous values or rows We humans say “I doubt that this is correct” Alternative rows, each with probability of correctness 7 Sept 2017Data quality: the data science struggle nobody mentions 45 TRUST AND DOUBT THE PROBABILISTIC WAY
  • 45. Little story  Peak into the life and frustrations of a data scientist  Awareness for responsible analytics Data imperfections: What a mess!  Ball bearings pilot  Models for data quality (problems) Some general mechanisms: Still quite a mess!  Standardization, error detection & discovery, cleaning … and I talked about my own research 7 Sept 2017Data quality: the data science struggle nobody mentions 55 WHAT HAVE I TALKED ABOUT
  • 46. 7 Sept 2017Data quality: the data science struggle nobody mentions 56 With this a data scientist can  not trust certain parts of input data as much as others  integrate data in a quick and dirty but responsible way  information about data problems is *in* the data  only solve data quality problems to the degree needed  measure data uncertainty and quality  documentation of data manipulation and its reasons know how DQ problems affect reliability of the results
  • 47. 7 Sept 2017Data quality: the data science struggle nobody mentions 57 (Francis Bacon, 1605) (Jorge Luis Borges, 1979)
  • 48.  dripping clock: http://www.gemfive.com  flower: http://cdn.tinybuddha.com  big data: http://tr1.cbsistatic.com  awareness: http://7minutesinthemorning.com 7 Sept 2017Data quality: the data science struggle nobody mentions 58 SOURCES OF IMAGES

Editor's Notes

  1. Or is CRISP the waterfall model of data science? The story is hypothetical but in part based on true events
  2. Wrong diagnosis
  3. Data scientists confirm to spend half or even three quarters of their time on data preparation! Bio-Informatics professor: more than half of the 4 years for a PhD is spent on ‘data fiddling’
  4. This why I entitled my talk as “Data quality: the data science struggle nobody mentions” … we should talk about it more!
  5. Mention DUO
  6. Notice that all these are “tables” Count (*) after Mercedes is 5 or 6. And after BMW, it is 3 or 4 or 5 or 6.
  7. TODO: deze slide wat explicieter / concreter maken
  8. Naturally we only do this in cases where there is sufficient probability for ambiguities / problems.
  9. Not as simple as I thought … still seems doable with a bit of string matching …
  10. Mention XML here