More Related Content Similar to Data Quality: Not Your Typical Database Problem (20) More from Distinguished Lecturer Series - Leon The Mathematician (6) Data Quality: Not Your Typical Database Problem1. Data Quality
Not your Typical Database Problem
Ahmed Elmagarmid
Executive Director
Qatar Computing Research Institute
2011 © Copyright QCRI. Confidential document.
2. Where are we located?
2 2011 © Copyright QCRI. Confidential document.
3. 3 3
2011 © Copyright QCRI. Confidential document.
4. 4 2011 © Copyright QCRI. Confidential document.
6. SCIENCE & COMMUNITY
EDUCATION RESEARCH DEVELOPMENT
2.8 percent of GDP to
be spent on research
annually by 2015
2011 © Copyright QCRI. Confidential document.
7. Qatar Foundation Research Division
Qatar Qatar Energy & Qatar
Computing Environment Biomedical
Research Research Research
Institute Institute Institute
QCRI QEERI QBRI
2011 © Copyright QCRI. Confidential document.
9. QCRI Vision
To make Qatar a global center for
computing research by becoming the
world’s recognized leader in Arabic
language technologies and in key areas
vital to the global growth of Qatari
business and entrepreneurial activity.
9 2011 © Copyright QCRI. Confidential document.
10. QCRI Model
Grand Challenges
National Institutions
(QCRI)
Grand practical challenges
Academia National and global impact
Localized skills & knowledge
Large teams and long term
Individual projects
Example peers: INRIA, MPI
Students move on
Theoretical & basic
Project-based
research Research Parks
Commercialization
Entrepreneurship
Incubation
Basic Research Applied Research
10
10 2011 © Copyright QCRI. Confidential document.
11. QCRI Ecosystem
QU
Sidra QBRI MIT
HKU
QEERI
QCRI
WikiMedia QSTP
Aljazeera
QP
ALTIS
Boeing
Energy Google
MEEZA Yahoo
Co. QSA
IBM
Microsoft
11 2011 © Copyright QCRI. Confidential document.
12. QCRI Research Centers
Arabic Social Scientific
Language Computing Computing
Technologies
Data Analytics
Cloud Computing
12 2011 © Copyright QCRI. Confidential document.
13. QCRI Scientific Advisory Council
Lord Rupert Redesdale
Prof. Rich DeMillo UK House of Lords
Georgia Tech, Chair
Prof. Joichi Ito Prof. Ruzena Bajcsy
MIT Media Lab Director University of California – Berkeley
Lew Tucker Prof. Alfred V. Aho
Vice President, Cisco Columbia University
Prof. Dick Lipton Yousef Khalidi
Georgia Tech Vice President, Microsoft
13 2011 © Copyright QCRI. Confidential document.
14. The 60 Doers!
Abdellatif
Ahmed
Richard
Jill
Management
Ihab
Nan
Mourad
and Support Team Richard P.
Paolo
Melissa
Data Analytics Amr Kamal
Halima
Amal
John Rashid
Nada Agathe Scientific
Michele Hend Chu
ElKindi
Computing Kulood
Samreen
Mohamed
Simon P.
Mustafa
Tarek
Preslav Othmane
Kareem Stephan
Ahmed A.
Wei William
Arabic Cloud
Ahmed T.
Language
ThuyLinh
Computing Sihem
Maged Gautam
Khaled Aysha
Ahmed M. Technologies Sofiane
Social
Ahmed A.
Gokop Computing
Ahmed T. Lolwa
Safdar
Amira Aybuke Shameem
Francisco Simon G.
Walid Peng Mikalai
Khulood Ruth
2011 © Copyright QCRI. Confidential document.
17. 5-YEAR QCRI MANPOWER PLAN
110
102
82
34 +20
+48 +8
21 +13
10-11 11-12 12-13 13-14 14-15
17 2011 © Copyright QCRI. Confidential document.
18. This Talk
Data Quality
18 2011 © Copyright QCRI. Confidential document.
19. Data Quality
Enhancing the usability of the acquired data and
increasing the confidence of query results
"Poor data quality is the norm rather than the exception, but most organizations are in a
state of denial about this issue. " -Gartner Group
19 2011 © Copyright QCRI. Confidential document.
20. Dirty Data is Expensive
Real life data is often dirty: Data Obama administration offered
error rates in industry: 1% - 30% $19 billion grants for health IT, i.e.
(Redman, 1998) improve EMRs in 2009
The Data Warehousing Institute
Erroneously priced data in retail estimates that data quality
databases costs US customers problems cost U.S. businesses
$2.5 billion each year more than $600 billion a year
(2002)
20 2011 © Copyright QCRI. Confidential document.
21. Where to start? Data Quality
everywhere!
• Data Entry
• Information Extraction
• Integration from multiple sources
• Standardization and transformation
• Business rules compliance
21 2011 © Copyright QCRI. Confidential document.
22. “Academic” Data Cleaning
”
● Pick a well understood data problem under some scoping
assumptions and solve independently
Duplicates
Functional Dependency violations
Matching dependency violations
Missing value imputation
● Piece-meal approach to tackle the complexity and sometimes the
intractability of the problem
Repairing violations of FD constraints in special cases (no deletion, left hand
side changes only, allowing variable etc.)
22 2011 © Copyright QCRI. Confidential document.
23. “Academic” Data Cleaning
”
• Despite their theoretic and algorithmic beauty, rarely used
– Problems never exist in isolation
– Fixes to one problem often introduce “other” problems
– Data usually not accessible to mess with
– Integrity constraints!... What integrity constraints?!!
23 2011 © Copyright QCRI. Confidential document.
24. “Practitioner” Data Cleaning
”
• Will share some scary stories
– “post-it notes” as an expert messaging system
– “written permission” to change value of a record
– Default values and best practices
– “Call John.. He will know what to do”
24 2011 © Copyright QCRI. Confidential document.
25. This Talk
● Few data quality challenges and (hopefully) research
directions
● Summary of recent efforts at QCRI
25 2011 © Copyright QCRI. Confidential document.
26. 10 Data Quality Issues
26 2011 © Copyright QCRI. Confidential document.
27. Issue 1: The data trio
DATA
Quality
27 2011 © Copyright QCRI. Confidential document.
28. Extraction remains a key source
of data errors
Acquiring the semantics/schema of the underlying unstructured data
sources (document, emails, related Web info, click traces, profiles,
interests, etc.)
28 2011 © Copyright QCRI. Confidential document.
29. Integration aggravates the
problem m1
Linked data as an attempt to live with errors .. link as you go
29 2011 © Copyright QCRI. Confidential document.
30. Slide 29
m1 I'm not sure about this idea of putting "linked data" so prominent in this slide on II
mourad, 7/23/2011
31. Issue 2: Data level or application
level
• Cleaning data tables by trusting the schema table! Is rarely useful
• Will share a story
– Bell-core with 1800 inter-linked databases
– Rule-based logic for sanity checking
– Post-it messages to communicate between data quality officers
.. Who work in shifts!
– Data cleaning action is meaningless if not tied to a business
logic or to a process. Should never be against FDs
30 2011 © Copyright QCRI. Confidential document.
32. Issue 3: Protect your gain: DQ
Dashboard
● How to protect against going backwards
● How to protect your gains during the cleansing process
● Metrics:
Minimality Principle: mostly and widely used in academic
cleaning
Value of information: to spot the most important problem to fix
31 2011 © Copyright QCRI. Confidential document.
33. Issue 3: Protect your gain - Ideas
• Root-cause analysis for data cleaning
• Chase problems to the source to reason about “progress”
• Leveraging “Provenance” to design progress meters
32 2011 © Copyright QCRI. Confidential document.
34. Issue 4: Data is not an orphan!
● Data Stewards are not imaginary characters! Important data
has stewards and custodians
● Need to go through these guardians first
Some health care requires a signed form per changed cell stating
reasons for change
● Possible approaches:
How to avoid stewards?
How to integrate them in the process or minimize their involvement?
33 2011 © Copyright QCRI. Confidential document.
35. Issue 5: How clean is clean?
• Quality awareness eats up 10% of the budget [Telecom
Experience]
• How to avoid over-cleaning
• Example: “Bill Forgiveness”, a real-life experience: roaming
charges and cross-carrier calls have a very complicated
business model
• Possible approaches
– Measure cleaning progress
– Clean only to satisfy some application needs
34 2011 © Copyright QCRI. Confidential document.
36. Issue 6: Online cleaning a
necessity not a feature
● We live in a complex world → complex applications with 100s
and 1000s of components and parameters
● Clean as you go .. Clean on demand .. Clean opportunistically ..
Can be the only hope
● New concepts:
Iterative cleaning
Cleaning dynamic and evolving data
● Off-line cleaning can still benefit historical data but is
becoming less and less important
35 2011 © Copyright QCRI. Confidential document.
37. Issue 7: Application quality
• Data Quality → Information Quality → Application quality
• Realizes the levels of complexity in current BI apps
• Data usage should influence data cleaning
– “Usage-based” data cleaning
36 2011 © Copyright QCRI. Confidential document.
38. Issue 8: SW engineering DQ
• Current focus on discrete values with simple integrity constraints
(FD, uniqueness…)
• We are good at checking if data complies with rules
• Real business rules are often “assertions” and expressed in
“turing-complete” languages
• Checking “did we write the assertions right?” becomes a lot harder
• But also.. need to think if we wrote the right assertions!
37 2011 © Copyright QCRI. Confidential document.
39. Issue 9: DQ Theory?
• ACID in transaction management were not only sensible requirements but
also had algorithms and methods to enforce them during transactions
processing
• Does it make sense to do the same for Quality? Plausible properties along
with actions for maintaining acceptable quality during data manipulation
• Some of these already exist: Timeliness, Currency, Consistency, etc. but
lack methods of enforcement
38 2011 © Copyright QCRI. Confidential document.
40. Issue 10: Scale .. Scale
• Terabytes and Petabytes of data requires new ways to
enforce data quality
• Which ball to drop
• Leveraging application semantics and data usage
• Sampling to learn from the few and apply on the masses
• Active learning to replace human feedback (GDR as a
solution)
39 2011 © Copyright QCRI. Confidential document.
42. GDR – Guided Data Repair
• Scalable ways to involve experts
• Repurposing destructive automatic techniques to guide repairs
• Value of Information measures to generate the most important
questions
User Query
• Judicious use of active learning from user feedback
Learn and
Detect Errors Clean Database
Repair
and Violations Instance
Database
Results
Input Database
Instance
41 2011 © Copyright QCRI. Confidential document.
44. Probabilistic Data Cleaning
User Query
Possible
Uncertain
Repair Clean Database
Error Detection Clean Database
Generation Instance
Possible
Instance
Clean Instance
Input
Probabilistic Results
Database
Instance
43 2011 © Copyright QCRI. Confidential document.
45. Possible Repairs
A possible repair is a clustering of the input tuples
Person Possible Repairs
ID Name ZIP Income X1 X2 X3
P1 Green 51519 30k {P1} {P1,P2} {P1,P2,P5}
P2 Green 51518 32k {P2} {P3,P4} {P3,P4}
P3 Peter 30528 40k {P3,P4} {P5} {P6}
Uncertain {P6}
P4 Peter 30528 40k {P5}
Clustering
P5 Gree 51519 55k {P6}
P6 Chuck 51519 30k
44 2011 © Copyright QCRI. Confidential document.