Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
DATA QUALITY
THE DATA SCIENCE STRUGGLE NOBODY MENTIONS
MAURICE VAN KEULEN
ASSOCIATE PROFESSOR
DATA MANAGEMENT TECHNOLOGY
H...
 First, a little story
 Data imperfections
 Some general mechanisms
 My research
7 Sept 2017Data quality: the data sci...
7 Sept 2017Data quality: the data science struggle nobody mentions 3
Domain
understanding
Data
understanding
Data
preparat...
7 Sept 2017Data quality: the data science struggle nobody mentions 4
7 Sept 2017Data quality: the data science struggle nobody mentions 5
Research on
Pregnancy processes
based on
Electronic P...
7 Sept 2017Data quality: the data science struggle nobody mentions 6
The start
1. Data scientist is a parent him/herself
2...
7 Sept 2017Data quality: the data science struggle nobody mentions 7
First analysis
and evaluation
1. Analysis with proces...
7 Sept 2017Data quality: the data science struggle nobody mentions 8
But then
(s)he notices …
and realizes …
 … a broken ...
7 Sept 2017Data quality: the data science struggle nobody mentions 9
and a painstaking
process starts
 Specify complex fi...
7 Sept 2017Data quality: the data science struggle nobody mentions 10
Fast forward a bit …
Re-perform analysis
and evaluat...
7 Sept 2017Data quality: the data science struggle nobody mentions 11
Realization
More cleaning
1. Clinician contacted for...
7 Sept 2017Data quality: the data science struggle nobody mentions 12
Complex
Many unexpected surprises in
domain/data un...
7 Sept 2017Data quality: the data science struggle nobody mentions 13
A data scientist should know and tell you about the...
7 Sept 2017Data quality: the data science struggle nobody mentions 14
When presented with analytics results / visualizatio...
DATA IMPERFECTIONS
What is it
 Data and specification on parts, substances, etc.
Why is it a problem?
 High requirements on data quality
 ...
Proposed approach
 Given catalogue / database with data on products
 Gather data on the same products from websites
(man...
7 Sept 2017Data quality: the data science struggle nobody mentions 18
PILOT: BALL BEARINGS
1. GIVEN CATALOGUE / DATABASE W...
7 Sept 2017Data quality: the data science struggle nobody mentions 19
PILOT: BALL BEARINGS
2. GATHER DATA ON THE SAME PROD...
7 Sept 2017Data quality: the data science struggle nobody mentions 20
PILOT EXPERIENCES: THE DIRT WE FOUND!!
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
7 Sept 2017Data quality: the data science struggle n...
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
7 Sept 2017Data quality: the data science struggle n...
Data integration very important in Data Science
 Example purpose: Data enrichment
Many names for the same problem
 Recor...
Data perspective User perspective
Context
independent
Spelling error
Missing data
Incorrect value
Duplicate data
Inconsist...
Data/information quality cannot be measured with one value
 Lot of research on describing desirable properties
 Examples...
“The data quality of a data set” is a meaningless notion
 Data quality depends on the purpose!
 What is good enough for ...
SOME GENERAL MECHANISMS
Wikipedia: “Process of implementing and developing
technical standards”
 Requires consensus
 Typically quite costly
 Be...
One can do quite a bit of checking
 Profiling of domains: detects improper values
 Profiling keys: values in column uniq...
What can be automatically discovered?
 Detection of outliers: possible erroneous values
 Similarity matching: possible d...
What can be automatically discovered?
 Detection of outliers: possible erroneous values
 Similarity matching: possible d...
 Manually / semi-automatic / automatic
 Techniques of previous slide can not only be used to
discover errors, but also t...
MY RESEARCH
How do we
humans find
our way in
this mess?
 We have domain knowledge
 We know what to expect
 We know what is likely and what is not
 We compare with what we kno...
THREE PRINCIPLES
Let me illustrate 2 & 3 with
an example of combining data
7 Sept 2017Data quality: the data science strug...
7 Sept 2017Data quality: the data science struggle nobody mentions 36
EXAMPLE COMBINING DATA
OBJECTIVE: DETERMINE PREFERRE...
7 Sept 2017Data quality: the data science struggle nobody mentions 37
VOILA! SEMANTIC DUPLICATES PROBLEM
Car brand Sales
B...
Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesC...
7 Sept 2017Data quality: the data science struggle nobody mentions 39
MOST DATA IMPERFECTIONS
CAN BE MODELED AS UNCERTAINT...
 Looks like ordinary database
 Several “possible”, “most likely” or “approximate”
answers to queries
 Important: Scalab...
Data integration very important in Data Science
 Example purpose: Data enrichment
Many names for the same problem
 Recor...
Let’s go for an initial
integration that can readily
and meaningfully be used
Let it improve during use
“Good is good enou...
User perceives (part of) source as of doubtful credibility
We humans say “I don’t trust this data”
Rows correct or not, s...
CONCLUSIONS
Little story
 Peak into the life and frustrations of a data scientist
 Awareness for responsible analytics
Data imperfec...
7 Sept 2017Data quality: the data science struggle nobody mentions 56
With this a data scientist can
 not trust certain p...
7 Sept 2017Data quality: the data science struggle nobody mentions 57
(Francis Bacon, 1605)
(Jorge Luis Borges, 1979)
 dripping clock: http://www.gemfive.com
 flower: http://cdn.tinybuddha.com
 big data: http://tr1.cbsistatic.com
 aware...
Nächste SlideShare
Wird geladen in …5
×

Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

477 Aufrufe

Veröffentlicht am

Presentation about data quality at the second Data Science MeetUp Twente https://www.meetup.com/Data-Meetup-Twente/events/241545781/ on "Responsible Data Analytics", 7 Sep 2017.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

  1. 1. DATA QUALITY THE DATA SCIENCE STRUGGLE NOBODY MENTIONS MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP
  2. 2.  First, a little story  Data imperfections  Some general mechanisms  My research 7 Sept 2017Data quality: the data science struggle nobody mentions 2 WHAT IS THIS TALK ABOUT?
  3. 3. 7 Sept 2017Data quality: the data science struggle nobody mentions 3 Domain understanding Data understanding Data preparation Modeling (Analysis) Deployment (Use) Data Evaluation (Interpretation) Where the magic happens
  4. 4. 7 Sept 2017Data quality: the data science struggle nobody mentions 4
  5. 5. 7 Sept 2017Data quality: the data science struggle nobody mentions 5 Research on Pregnancy processes based on Electronic Patient Dossiers (EPDs) of some population of women
  6. 6. 7 Sept 2017Data quality: the data science struggle nobody mentions 6 The start 1. Data scientist is a parent him/herself 2. Gather records from patient’s EPDs for pregnancy period from multiple sources 3. Fairly straightforward to identify consults, tests, scans, conditions 4. Extract and store them
  7. 7. 7 Sept 2017Data quality: the data science struggle nobody mentions 7 First analysis and evaluation 1. Analysis with process mining tool 2. Interaction with visualization 3. Interpret results 4. (S)he already sees some interesting patterns
  8. 8. 7 Sept 2017Data quality: the data science struggle nobody mentions 8 But then (s)he notices … and realizes …  … a broken leg … dozens of specialists …  Assumption wrong: “all records that belong to a pregnant woman are related to pregnancy”  Too many records selected during preparation  No objective means to ascertain this: No field ‘related to pregnancy’
  9. 9. 7 Sept 2017Data quality: the data science struggle nobody mentions 9 and a painstaking process starts  Specify complex filter rules  Inspect samples of (not) selected records  Repeat! Quick and dirty or thorough? Never perfect! What is good enough? How does it affect results? fatigue can show later pregnancy related or not
  10. 10. 7 Sept 2017Data quality: the data science struggle nobody mentions 10 Fast forward a bit … Re-perform analysis and evaluation 1. Interaction with mining tool and visualization 2. Something strange in the times of consults: Consult after blood test it prescribed???
  11. 11. 7 Sept 2017Data quality: the data science struggle nobody mentions 11 Realization More cleaning 1. Clinician contacted for explanation: Notes during consult put in EPD in evenings! 2. Modification of EPD record (what is recorded) ≠ actual moment of activity (semantics) Sequence and duration noise 3. More data cleaning ensues
  12. 12. 7 Sept 2017Data quality: the data science struggle nobody mentions 12 Complex Many unexpected surprises in domain/data understanding Time-consuming Most time spent on data preparation (upto 50-80%)
  13. 13. 7 Sept 2017Data quality: the data science struggle nobody mentions 13 A data scientist should know and tell you about the deficiencies in the data and the results Quick and dirty or thorough? Never perfect! What is good enough? How does it affect results?
  14. 14. 7 Sept 2017Data quality: the data science struggle nobody mentions 14 When presented with analytics results / visualizations, Let people know you don’t trust any results if they don’t have good answers to questions like these How was it cleaned? Data quality problems? Data discarded? How reliable are these results? How does this affect the results?
  15. 15. DATA IMPERFECTIONS
  16. 16. What is it  Data and specification on parts, substances, etc. Why is it a problem?  High requirements on data quality  Errors and duplicates may be costly or even pose health risks Even so, it is a mess 7 Sept 2017Data quality: the data science struggle nobody mentions 16 PRODUCT DATA WHAT IS IT AND WHY IS IT A PROBLEM?
  17. 17. Proposed approach  Given catalogue / database with data on products  Gather data on the same products from websites (many more or less independent sources)  Consolidate: match, merge and clean One enriched description for each product 7 Sept 2017Data quality: the data science struggle nobody mentions 17 PRODUCT INFORMATION CLEANING AND ENRICHMENT
  18. 18. 7 Sept 2017Data quality: the data science struggle nobody mentions 18 PILOT: BALL BEARINGS 1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS
  19. 19. 7 Sept 2017Data quality: the data science struggle nobody mentions 19 PILOT: BALL BEARINGS 2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE Get product pages Extact data Consolidate (match, merge, clean)
  20. 20. 7 Sept 2017Data quality: the data science struggle nobody mentions 20 PILOT EXPERIENCES: THE DIRT WE FOUND!!
  21. 21. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 7 Sept 2017Data quality: the data science struggle nobody mentions 21 DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000  Data sample looks fine  Result of analysis looks perfectly reasonable  If you don’t look hard enough if you don’t properly pay attention to it … you will be unaware … that you are possibly looking at significantly erroneous figures!!!
  22. 22. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 7 Sept 2017Data quality: the data science struggle nobody mentions 22 DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000 CustID Sales Name 6789 2 Tom 4567 6000 Jon 5678 NULL Nina … … … ???? Wrong values included Missing data Double counting etc. Many more problems at value, record, schema, source, trust levels
  23. 23. Data integration very important in Data Science  Example purpose: Data enrichment Many names for the same problem  Record linkage, Entity resolution, Entity linking, Semantic duplicates, Data coupling, Data fusion, etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 23 DATA INTEGRATION: SEMANTIC ENTITY LINKING ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM CustID Name … 1234 P. Jansen … 2345 P. Janssen … 3456 P. Janssen … … … … EmpNr Name … 6789 P. Jansen … 4567 P. Jansen … 5678 P. Janssen … … … … ? They could all be the same person or all different persons!
  24. 24. Data perspective User perspective Context independent Spelling error Missing data Incorrect value Duplicate data Inconsistent data format Syntax violation Violation of integrity constraints Heterogeneity of measurement units Existence of synonyms and homonyms Information is inaccessible Information is insecure Information is hardly retrievable Information is difficult to aggregate Errors in the information transformation Context dependent Violation of domain constraints Violation of organization’s business rules Violation of company and government regulations Violation of constraints provided by the database administrator The information is not based on fact Information is of doubtful credibility Information presents impartial view Information is irrelevant to the work Information is incomplete Information is compactly represented Information is hard to manipulate Information is hard to understand Information is outdated 7 Sept 2017Data quality: the data science struggle nobody mentions 24 CATEGORIZATION OF DATA QUALITY PROBLEMS SOURCE: P. WOODALL, M. OBERHOFER, A. BOREK, “A CLASSIFICATION OF DATA QUALITY ASSESSMENT AND IMPROVEMENT METHODS”, JDIQ 3(4), 2014
  25. 25. Data/information quality cannot be measured with one value  Lot of research on describing desirable properties  Examples: Accuracy, Correctness, Currency, Completeness, Relevance, Reliability, etc.  Metrics: concrete means to estimate score on dims Quite futile endeavor in my opinion  200+ such dimensions have been identified, even more possible metrics  Little agreement  Expensive, too complex DATA QUALITY DIMENSIONS AND METRICS 7 Sept 2017Data quality: the data science struggle nobody mentions 25
  26. 26. “The data quality of a data set” is a meaningless notion  Data quality depends on the purpose!  What is good enough for one purpose, may be insufficient for another Alternative means of measurement  Define purpose as a set of queries on the data  Measure quality of the answers  Purpose determines relevant dimensions & metrics 7 Sept 2017Data quality: the data science struggle nobody mentions 26 DATA QUALITY DEPENDS ON THE PURPOSE
  27. 27. SOME GENERAL MECHANISMS
  28. 28. Wikipedia: “Process of implementing and developing technical standards”  Requires consensus  Typically quite costly  Benefits but also downsides (lack of variety, exceptions, slow evolution, defectors, etc.)  In essence non-technical solution, but can be supported by technology Useful but inherently imperfect itself STANDARDIZATION 7 Sept 2017Data quality: the data science struggle nobody mentions 28
  29. 29. One can do quite a bit of checking  Profiling of domains: detects improper values  Profiling keys: values in column unique, expected keys  Verification of foreign key/primary key relationships  Verification of constraints and business rules  Verification of expected dependencies: inclusion, (conditional) functional dependencies Detects missing / erroneous rows  Matching with reference data  etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 29 ERROR DETECTION BY VERIFICATION & PROFILING
  30. 30. What can be automatically discovered?  Detection of outliers: possible erroneous values  Similarity matching: possible duplicates  Inclusion detection: possible inclusion dependencies ➠ possible missing erroneous rows  Dependency detection: possible (conditional) functional dependencies ➠ possible erroneous values, possible duplicates  Prediction: machine learning to predict value based on other data ➠ possible erroneous values (e.g., categorizations) There is much much more  7 Sept 2017Data quality: the data science struggle nobody mentions 30 ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
  31. 31. What can be automatically discovered?  Detection of outliers: possible erroneous values  Similarity matching: possible duplicates  Inclusion detection: possible inclusion dependencies ➠ possible missing erroneous rows  Dependency detection: possible (conditional) functional dependencies ➠ possible erroneous values, possible duplicates  Prediction: machine learning to predict value based on other data ➠ possible erroneous values (e.g., categorizations) There is much much more  7 Sept 2017Data quality: the data science struggle nobody mentions 31 ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY Very useful but again inherently imperfect itself
  32. 32.  Manually / semi-automatic / automatic  Techniques of previous slide can not only be used to discover errors, but also to automatically clean them; If sufficiently probable  duplicate rows can be merged  erroneous rows deleted  erroneous values corrected  Missing rows / values  Data imputation: fill with predicted values 7 Sept 2017Data quality: the data science struggle nobody mentions 32 DATA CLEANING
  33. 33. MY RESEARCH How do we humans find our way in this mess?
  34. 34.  We have domain knowledge  We know what to expect  We know what is likely and what is not  We compare with what we know and expect  We doubt  We (dis)trust  We learn from others  We reconsider  etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 34 HOW DO WE HUMANS FIND OUR WAY IN THIS MESS? Probabilities Uncertainty Metadata of context: source, situation, … Evidence gathering User feedback
  35. 35. THREE PRINCIPLES Let me illustrate 2 & 3 with an example of combining data 7 Sept 2017Data quality: the data science struggle nobody mentions 35
  36. 36. 7 Sept 2017Data quality: the data science struggle nobody mentions 36 EXAMPLE COMBINING DATA OBJECTIVE: DETERMINE PREFERRED CUSTOMER (WITH SALES > 100) Keulen, M. (2012) Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). pp. 138-146. ISSN 1611-2776 Car brand Sales B.M.W. 25 Mercedes 32 Renault 10 Car brand Sales BMW 72 Mercedes-Benz 39 Renault 20 Car brand Sales Bayerische Motoren Werke 8 Mercedes 35 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45
  37. 37. 7 Sept 2017Data quality: the data science struggle nobody mentions 37 VOILA! SEMANTIC DUPLICATES PROBLEM Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Preferred customers … SELECT SUM(Sales) FROM CarSales WHERE Sales>100 0 ‘No preferred customers’
  38. 38. Database Real world (of car brands) Mercedes-Benz 39 72BMW 45Renault 67Mercedes 8 Bayerische Motoren Werke 25B.M.W. SalesCar brand ω d1 d2 d3 d4 d5 d6 o1 o2 o3 o4 7 Sept 2017Data quality: the data science struggle nobody mentions 38 SEMANTIC DUPLICATES
  39. 39. 7 Sept 2017Data quality: the data science struggle nobody mentions 39 MOST DATA IMPERFECTIONS CAN BE MODELED AS UNCERTAINTY IN DATA Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Mercedes 106 Mercedes-Benz 106 1 2 3 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1 X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct name 0.5 Y=1 “Mercedes-Benz” correct name 0.5 B.M.W. / BMW / Bayerische Motoren Werke analogously Run some duplicate detection tool (similarity matching)
  40. 40.  Looks like ordinary database  Several “possible”, “most likely” or “approximate” answers to queries  Important: Scalability (big data!) Sales of “preferred customers”  SELECT SUM(sales) FROM carsales WHERE sales≥ 100  Answer: 106 (most likely) 7 Sept 2017Data quality: the data science struggle nobody mentions 40 WHAT I HAVE NOW IS A PROBABILISTIC DATABASE SUM(sales) P 0 14% 105 6% 106 56% 211 24% Second most likely answer at 24% with impact factor 2 in sales (211 vs 106)
  41. 41. Data integration very important in Data Science  Example purpose: Data enrichment Many names for the same problem  Record linkage, Entity resolution, Entity linking, Semantic duplicates, Data coupling, Data fusion, etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 41 DATA INTEGRATION: SEMANTIC ENTITY LINKING ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM CustID Name … 1234 P. Jansen … 2345 P. Janssen … 3456 P. Janssen … … … … EmpNr Name … 6789 P. Jansen … 4567 P. Jansen … 5678 P. Janssen … … … … ? They could all be the same person or all different persons! Remember this slide?
  42. 42. Let’s go for an initial integration that can readily and meaningfully be used Let it improve during use “Good is good enough” 7 Sept 2017Data quality: the data science struggle nobody mentions DATA INTEGRATION THE PROBABILISTIC WAY Use Gather evidence Improve data quality Partial data integration Enumerate cases for remaining problems Store data with uncertainty in UDBMS InitialintegrationContinuousimprovement 42
  43. 43. User perceives (part of) source as of doubtful credibility We humans say “I don’t trust this data” Rows correct or not, so each with variable & probability Sources contain conflicting data about (possibly) the same entities; We humans say “I’m left with some doubt” We already saw this one: car brand example! Automatic error discovery tool (AEDT) detects possible erroneous values or rows We humans say “I doubt that this is correct” Alternative rows, each with probability of correctness 7 Sept 2017Data quality: the data science struggle nobody mentions 45 TRUST AND DOUBT THE PROBABILISTIC WAY
  44. 44. CONCLUSIONS
  45. 45. Little story  Peak into the life and frustrations of a data scientist  Awareness for responsible analytics Data imperfections: What a mess!  Ball bearings pilot  Models for data quality (problems) Some general mechanisms: Still quite a mess!  Standardization, error detection & discovery, cleaning … and I talked about my own research 7 Sept 2017Data quality: the data science struggle nobody mentions 55 WHAT HAVE I TALKED ABOUT
  46. 46. 7 Sept 2017Data quality: the data science struggle nobody mentions 56 With this a data scientist can  not trust certain parts of input data as much as others  integrate data in a quick and dirty but responsible way  information about data problems is *in* the data  only solve data quality problems to the degree needed  measure data uncertainty and quality  documentation of data manipulation and its reasons know how DQ problems affect reliability of the results
  47. 47. 7 Sept 2017Data quality: the data science struggle nobody mentions 57 (Francis Bacon, 1605) (Jorge Luis Borges, 1979)
  48. 48.  dripping clock: http://www.gemfive.com  flower: http://cdn.tinybuddha.com  big data: http://tr1.cbsistatic.com  awareness: http://7minutesinthemorning.com 7 Sept 2017Data quality: the data science struggle nobody mentions 58 SOURCES OF IMAGES

×