This document discusses managing uncertainty in data and data quality problems. It describes how most data quality problems can be modeled as uncertainty in the data. Probabilistic databases can store, query, and analyze data while accounting for this uncertainty. This allows for scalable and "good enough" initial data integration that can improve over time, avoiding excessive "data fiddling". Measuring expected precision and recall provides a way to quantitatively assess quality and know when cleaning efforts should stop.
Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016
1. MANAGING UNCERTAINTY IN DATA
THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY
PROBLEMS
MAURICE VAN KEULEN
2. Paradigms of scientific method
ï§ Empiricism
ï§ Mathematical modeling
ï§ Simulation
A new paradigm: Data-intensive Scientific Discovery
ï§ Combining and analyzing data in novel ways is
capable of tackling research questions that could not
be answered before
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
2
REVOLUTION IN SCIENTIFIC METHOD
Bio-Informatics professor:
â PhD of 4 years, 3 years
devoted to âdata fiddlingâ â
3. Research on pregnancy processes based on Electronic
Patient Dossiers (EPDs) of some population of women
ï§ Select consult & treatment records from their EPDs
from multiple sources
ï§ After first analysis one discovers many records not
related to pregnancy (e.g., dermatologist consult)
ï§ Assumption that all records that belong to a pregnant
woman are related to pregnancy is wrong, hence also
the selection criterion!
ï§ There is no objective means to ascertain this such as a
field ârelated to pregnancyâ
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
3
A FIRST STORY: PREGNANCY RESEARCH
4. ï§ A painstaking process follows with specifying filter
rules and manually inspecting samples of results
ï§ Imperfect process so noisy records remain!
ï§ Wrong diagnoses cause more records to be
erroneously in or out ïš more noisy records
ï§ Then, one looks at a sample and notices something
strange in the times of consults: many appear close to
each other and in the evening
ï§ Modification time of EPD record (what is recorded)
does not reflect actual moment of activity (semantics)
ïš sequence and duration noise
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
4
A FIRST STORY: PREGNANCY RESEARCH
5. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
5
GEO-SOCIAL RECOMMENDATION: GPS TRAJECTORIES
âą Detect visits from trajectories
âą GPS traces from mobile phones
âą Point-Of-Interest (POI) data
harvested from the internet
âą Purpose: construct profiles of
âą Customers
âą Products
âą for recommendation
âą Holiday homes
âą Greeting cards
6. Substantial amount of money involved in fraud. Ease of
committing fraud incites otherwise decent people to do it
as well. Danger to society
ï§ Inspect where there is a high risk of fraud
ï§ Example ISZW: labor market, labor circumstances, etc.
ï§ But: government data represents paper reality!
ïInclude traces from the internet (social media, web
forums): Customers, employees, and by-standers
leave behind observations and opinions
ï§ But natural language: about which company do they
talk?
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
6
DATA-DRIVE FRAUD RISK ANALYSIS
7. ï§ Paris Hilton stayed in the Paris Hilton
ï§ Lady Gaga - Speechless live @ Helsinki 10/13/2010
http://www.youtube.com/watch?v=yREociHyijk . . .
@ladygaga also talks about her Grampa who died
recently
ï§ Laelith Demonia has just defeated liwanu Hird.
Career wins is 575, career losses is 966.
ï§ Adding Win7Beta, Win2008, and Vista x64 and x86
images to munin. #wds
ï§ history should show that bush jr should be in jail or at
least never should have been president
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
7
NATURAL LANGUAGE PROCESSING: AMBIGUITY ABOUNDS
8. ï§ Search (finding the needle in the haystack)
ï§ Information extraction from unstructured sources
ï§ Natural language processing
ï§ Web harvesting
(both produce lower quality structured data)
ï§ Data quality management
ï§ Responsible analytics is (among other things)
âKnowing how data quality problems in the source
data affect the analytical resultsâ
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
8
TECHNOLOGY WE WORK ON
WE = DATABASES GROUP FROM UNIVERSITY OF TWENTE
Equally
true for
Business
Analytics
9. CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
⊠⊠âŠ
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
10
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
ï§ Sample of data looks
fine
ï§ Result of analysis looks
perfectly reasonable
ï§ If you donât look hard
enough
if you donât properly pay
attention to it
⊠you will be unaware
⊠that you are possibly
looking at significantly
erroneous figures!!!
10. CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
⊠⊠âŠ
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
11
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina
⊠⊠âŠ
????
Wrong figures included
Missing figures
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels
11. Probabilistic database technology can store, query,
analyze, reason with data taking into account possible
influence on the results
ï§ Treats data quality problems as a fact of life
ï§ Responsible analytics: know deficiencies of results
ï§ Generic and scalable approach and technology
ï§ Nice properties for application: postpone-
resolution/cleaning, pay-as-you-go; good-is-good-
enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
12
PROBABILISTIC DATABASES TO THE RESCUE
12. Letâs go for an initial
integration that can readily
and meaningfully be used
âGood is good enoughâ for
meaningful use in many
applications
(can be achieved 10x
earlier)
Let it improve during use
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
PROBABILISTIC DATA INTEGRATION
Use
(analytics)
Measure
quality
Improve
data quality
Partial data
integration
Enumerate cases for
remaining problems
Store data with
uncertainty in UDBMS
InitialintegrationContinuousimprovement
13
Postpon
e
problems
Stop
earlier
Pay as
you go
Human
in the
loop
13. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
14
COMBINING DATA âŠ
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
14. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
15
⊠AND THE PROBLEM OF SEMANTIC DUPLICATES
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers âŠ
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
âNo preferred customersâ
15. Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesCar brand Ï
d1
d2
d3
d4
d5
d6
o1
o2
o3
o4
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
16
SEMANTIC DUPLICATES
16. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
17
MOST DATA QUALITY PROBLEMS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 âMercedesâ
correct name
0.5
Y=1 âMercedes-Benzâ
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate
detection tool
17. ï§ Looks like ordinary database
ï§ Several âpossibleâ answers or approximate answers to
queries
ï§ Important: Scalability (big data!)
Sales of âpreferred customersâ
ï§ SELECT SUM(sales)
FROM carsales
WHERE salesâ„ 100
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
18
IMPORTANT TOOL: PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
18. Sales of âpreferred customersâ
ï§ SELECT SUM(sales)
FROM carsales
WHERE salesâ„ 100
ï§ Answer: 106
ï§ Risk = Probability * Impact
ï§ Analyst only bothered with
problems that matter
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
19
QUERYING AND RELIABILITY ASSESSMENT
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Risk of substantially
wrong answer
19. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
20
BACK TO GEO-SOCIAL RECOMMENDATION
HOW TO MODEL THE GPS TRAJECTORY PROBLEM?
ï§ Smoothing: any jumps and/or sudden sharp angles
are suspicious and probably wrong
ï§ Points become
estimated points
ï§ Some points are
possibly suspicious
ï§ Some are more
suspicious than others
ïModel the uncertainty
explicitly in the data
20. Fraud risk analysis
ï§ about which company do they talk?
ï§ Indicators become possible indicators
ï§ Fraud risk analysis is statistics / probability theory!
Reasoning with possible indicators is very easy. Itâs just more data
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
21
AMBIGUITY IN NATURAL LANGUAGE PROCESSING
AND ITS CONSEQUENCES FOR FRAUD RISK ANALYSIS
Paris Hilton
stayed in the
Paris Hilton
Phrase begin end type ref
Paris 1 1 City sws.geonames.org/
2988507
Paris 1 1 Firstname
Hilton 1 1 Lastname
Paris Hilton 1 2 Person https://en.wikipedia.org/wi
ki/Paris_Hilton
Paris Hilton 1 2 Hotel www.hilton.com/Paris
⊠⊠⊠âŠ
âbelong
togetherâ
21. ï§ Inspired from information retrieval
(search engine evaluation)
ï§ Precision = ratio of answers that are correct
(3/5 = 60%)
ï§ Recall = ratio of correct answers given
(3/4 = 75%)
ï§ Expected precision and recall
ï§ A correct answer is better if the system dares to
claim that it is correct with a higher probability
ï§ Analogously, incorrect answers with a high
probability are worse than incorrect answers
with a low probability
ï§ Expected precision = (0.8+0.7+0.2) / 2.3 = 74%
ï§ Expected recall = (0.8+0.7+0.2) / 4 = 43%
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
22
KNOW WHEN TO STOP CLEANING: MEASURING QUALITY
A
B
C
D
E
F
G
80%
70%
50%
20%
10%
22. Data quality: intangible problem with unknown impact
The key to effective management of DQ problems
ï§ Model DQ problems as uncertainty *in* the data
ï§ Probabilistic database technology for scalability
ï§ Postpone resolution/cleaning: pay-as-you-go
ï§ Measure and know when to stop:
good-is-good-enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
23
CONCLUSIONS
Bio-Informatics professor:
â PhD of 4 years, 3 years
devoted to âdata fiddlingâ â
If we can reduce the data fiddling
with 1 year (33%), we make the
scientist twice as productive!
Hinweis der Redaktion
Abstract
Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or "Uncertain Database". Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data
Explain Bio-Informatics
Of course there are others, e.g., BioInformatics
First two examples showed data quality and semantical problems, if you do NLP you are faced with the same!
Refer back to pregnancy and movie examples: all those issues can be modeled as uncertainty in data. Queries and analytics results will give all possible results, i.e., handle for influence on results
With OSINT data, this problem of semantic duplicates is enormous .,..
Notice that all these are âtablesâ
TODO: deze slide wat explicieter / concreter maken