SlideShare ist ein Scribd-Unternehmen logo
1 von 22
MANAGING UNCERTAINTY IN DATA
THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY
PROBLEMS
MAURICE VAN KEULEN
Paradigms of scientific method
 Empiricism
 Mathematical modeling
 Simulation
A new paradigm: Data-intensive Scientific Discovery
 Combining and analyzing data in novel ways is
capable of tackling research questions that could not
be answered before
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
2
REVOLUTION IN SCIENTIFIC METHOD
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”
Research on pregnancy processes based on Electronic
Patient Dossiers (EPDs) of some population of women
 Select consult & treatment records from their EPDs
from multiple sources
 After first analysis one discovers many records not
related to pregnancy (e.g., dermatologist consult)
 Assumption that all records that belong to a pregnant
woman are related to pregnancy is wrong, hence also
the selection criterion!
 There is no objective means to ascertain this such as a
field ‘related to pregnancy’
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
3
A FIRST STORY: PREGNANCY RESEARCH
 A painstaking process follows with specifying filter
rules and manually inspecting samples of results
 Imperfect process so noisy records remain!
 Wrong diagnoses cause more records to be
erroneously in or out  more noisy records
 Then, one looks at a sample and notices something
strange in the times of consults: many appear close to
each other and in the evening
 Modification time of EPD record (what is recorded)
does not reflect actual moment of activity (semantics)
 sequence and duration noise
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
4
A FIRST STORY: PREGNANCY RESEARCH
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
5
GEO-SOCIAL RECOMMENDATION: GPS TRAJECTORIES
‱ Detect visits from trajectories
‱ GPS traces from mobile phones
‱ Point-Of-Interest (POI) data
harvested from the internet
‱ Purpose: construct profiles of
‱ Customers
‱ Products
‱ for recommendation
‱ Holiday homes
‱ Greeting cards
Substantial amount of money involved in fraud. Ease of
committing fraud incites otherwise decent people to do it
as well. Danger to society
 Inspect where there is a high risk of fraud
 Example ISZW: labor market, labor circumstances, etc.
 But: government data represents paper reality!
Include traces from the internet (social media, web
forums): Customers, employees, and by-standers
leave behind observations and opinions
 But natural language: about which company do they
talk?
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
6
DATA-DRIVE FRAUD RISK ANALYSIS
 Paris Hilton stayed in the Paris Hilton
 Lady Gaga - Speechless live @ Helsinki 10/13/2010
http://www.youtube.com/watch?v=yREociHyijk . . .
@ladygaga also talks about her Grampa who died
recently
 Laelith Demonia has just defeated liwanu Hird.
Career wins is 575, career losses is 966.
 Adding Win7Beta, Win2008, and Vista x64 and x86
images to munin. #wds
 history should show that bush jr should be in jail or at
least never should have been president
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
7
NATURAL LANGUAGE PROCESSING: AMBIGUITY ABOUNDS
 Search (finding the needle in the haystack)
 Information extraction from unstructured sources
 Natural language processing
 Web harvesting
(both produce lower quality structured data)
 Data quality management
 Responsible analytics is (among other things)
“Knowing how data quality problems in the source
data affect the analytical results”
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
8
TECHNOLOGY WE WORK ON
WE = DATABASES GROUP FROM UNIVERSITY OF TWENTE
Equally
true for
Business
Analytics
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart

 
 

14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
10
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
 Sample of data looks
fine
 Result of analysis looks
perfectly reasonable
 If you don’t look hard
enough
if you don’t properly pay
attention to it

 you will be unaware

 that you are possibly
looking at significantly
erroneous figures!!!
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart

 
 

14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
11
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina

 
 

????
Wrong figures included
Missing figures
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels
Probabilistic database technology can store, query,
analyze, reason with data taking into account possible
influence on the results
 Treats data quality problems as a fact of life
 Responsible analytics: know deficiencies of results
 Generic and scalable approach and technology
 Nice properties for application: postpone-
resolution/cleaning, pay-as-you-go; good-is-good-
enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
12
PROBABILISTIC DATABASES TO THE RESCUE
Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in many
applications
(can be achieved 10x
earlier)
Let it improve during use
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
PROBABILISTIC DATA INTEGRATION
Use
(analytics)
Measure
quality
Improve
data quality
Partial data
integration
Enumerate cases for
remaining problems
Store data with
uncertainty in UDBMS
InitialintegrationContinuousimprovement
13
Postpon
e
problems
Stop
earlier
Pay as
you go
Human
in the
loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
14
COMBINING DATA 

Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
15

 AND THE PROBLEM OF SEMANTIC DUPLICATES
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers 

SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’
Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesCar brand ω
d1
d2
d3
d4
d5
d6
o1
o2
o3
o4
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
16
SEMANTIC DUPLICATES
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
17
MOST DATA QUALITY PROBLEMS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate
detection tool
 Looks like ordinary database
 Several “possible” answers or approximate answers to
queries
 Important: Scalability (big data!)
Sales of “preferred customers”
 SELECT SUM(sales)
FROM carsales
WHERE sales≄ 100
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
18
IMPORTANT TOOL: PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Sales of “preferred customers”
 SELECT SUM(sales)
FROM carsales
WHERE sales≄ 100
 Answer: 106
 Risk = Probability * Impact
 Analyst only bothered with
problems that matter
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
19
QUERYING AND RELIABILITY ASSESSMENT
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Risk of substantially
wrong answer
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
20
BACK TO GEO-SOCIAL RECOMMENDATION
HOW TO MODEL THE GPS TRAJECTORY PROBLEM?
 Smoothing: any jumps and/or sudden sharp angles
are suspicious and probably wrong
 Points become
estimated points
 Some points are
possibly suspicious
 Some are more
suspicious than others
Model the uncertainty
explicitly in the data
Fraud risk analysis
 about which company do they talk?
 Indicators become possible indicators
 Fraud risk analysis is statistics / probability theory!
Reasoning with possible indicators is very easy. It’s just more data
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
21
AMBIGUITY IN NATURAL LANGUAGE PROCESSING
AND ITS CONSEQUENCES FOR FRAUD RISK ANALYSIS
Paris Hilton
stayed in the
Paris Hilton
Phrase begin end type ref
Paris 1 1 City sws.geonames.org/
2988507
Paris 1 1 Firstname
Hilton 1 1 Lastname
Paris Hilton 1 2 Person https://en.wikipedia.org/wi
ki/Paris_Hilton
Paris Hilton 1 2 Hotel www.hilton.com/Paris

 
 
 

“belong
together”
 Inspired from information retrieval
(search engine evaluation)
 Precision = ratio of answers that are correct
(3/5 = 60%)
 Recall = ratio of correct answers given
(3/4 = 75%)
 Expected precision and recall
 A correct answer is better if the system dares to
claim that it is correct with a higher probability
 Analogously, incorrect answers with a high
probability are worse than incorrect answers
with a low probability
 Expected precision = (0.8+0.7+0.2) / 2.3 = 74%
 Expected recall = (0.8+0.7+0.2) / 4 = 43%
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
22
KNOW WHEN TO STOP CLEANING: MEASURING QUALITY
A
B
C
D
E
F
G
80%
70%
50%
20%
10%
Data quality: intangible problem with unknown impact
The key to effective management of DQ problems
 Model DQ problems as uncertainty *in* the data
 Probabilistic database technology for scalability
 Postpone resolution/cleaning: pay-as-you-go
 Measure and know when to stop:
good-is-good-enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
23
CONCLUSIONS
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”
If we can reduce the data fiddling
with 1 year (33%), we make the
scientist twice as productive!

Weitere Àhnliche Inhalte

Ähnlich wie Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScalePrecisely
 
Analytic Transformation | 2013 Loras College Business Analytics Symposium
Analytic Transformation | 2013 Loras College Business Analytics SymposiumAnalytic Transformation | 2013 Loras College Business Analytics Symposium
Analytic Transformation | 2013 Loras College Business Analytics SymposiumCartegraph
 
M2828_Marketing Analytics Brochure_5-26-2016.pdf
M2828_Marketing Analytics Brochure_5-26-2016.pdfM2828_Marketing Analytics Brochure_5-26-2016.pdf
M2828_Marketing Analytics Brochure_5-26-2016.pdfEdmund-Graham Balogun
 
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...Naveen Agarwal
 
Building an integrated data strategy
Building an integrated data strategyBuilding an integrated data strategy
Building an integrated data strategyLucas Modesto
 
State and Trends of the Analytics Market by Jose Fernandez
State and Trends of the Analytics Market by Jose FernandezState and Trends of the Analytics Market by Jose Fernandez
State and Trends of the Analytics Market by Jose FernandezJose Pablo Fernandez
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyLyn Fenex
 
Why Sales and Marketing Specialists will become Big Data Scientists
Why Sales and Marketing Specialists  will become Big Data ScientistsWhy Sales and Marketing Specialists  will become Big Data Scientists
Why Sales and Marketing Specialists will become Big Data ScientistsCindyGordon
 
BA and Beyond 20 - Bas Van Gils - Data management: from the trenches
BA and Beyond 20 - Bas Van Gils - Data management: from the trenchesBA and Beyond 20 - Bas Van Gils - Data management: from the trenches
BA and Beyond 20 - Bas Van Gils - Data management: from the trenchesBA and Beyond
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyLyn Fenex
 
Data Scientist: the Sexiest Job of the 21st Century
Data Scientist: the Sexiest Job of the 21st CenturyData Scientist: the Sexiest Job of the 21st Century
Data Scientist: the Sexiest Job of the 21st CenturyLyn Fenex
 
data is worthless if you don't communicate it
data is worthless if you don't communicate itdata is worthless if you don't communicate it
data is worthless if you don't communicate itAstha Jagetiya
 
Big Data Analytics: A New Business Opportunity
Big Data Analytics: A New Business OpportunityBig Data Analytics: A New Business Opportunity
Big Data Analytics: A New Business OpportunityEdward Curry
 
Odgers Berndtson and Unico Big Data White Paper
Odgers Berndtson and Unico Big Data White PaperOdgers Berndtson and Unico Big Data White Paper
Odgers Berndtson and Unico Big Data White PaperRobertson Executive Search
 
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...InterCon
 
State of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer GoodsState of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer GoodsSPI Conference
 
Predictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallPredictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallDATAVERSITY
 
Digital Catapult's Innovation Optimism Index
Digital Catapult's Innovation Optimism IndexDigital Catapult's Innovation Optimism Index
Digital Catapult's Innovation Optimism IndexCallum Lee
 
PPT1-Buss Intel Analytics.pptx
PPT1-Buss Intel  Analytics.pptxPPT1-Buss Intel  Analytics.pptx
PPT1-Buss Intel Analytics.pptxssuser28b150
 

Ähnlich wie Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016 (20)

Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
 
Analytic Transformation | 2013 Loras College Business Analytics Symposium
Analytic Transformation | 2013 Loras College Business Analytics SymposiumAnalytic Transformation | 2013 Loras College Business Analytics Symposium
Analytic Transformation | 2013 Loras College Business Analytics Symposium
 
M2828_Marketing Analytics Brochure_5-26-2016.pdf
M2828_Marketing Analytics Brochure_5-26-2016.pdfM2828_Marketing Analytics Brochure_5-26-2016.pdf
M2828_Marketing Analytics Brochure_5-26-2016.pdf
 
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
 
Building an integrated data strategy
Building an integrated data strategyBuilding an integrated data strategy
Building an integrated data strategy
 
Business
BusinessBusiness
Business
 
State and Trends of the Analytics Market by Jose Fernandez
State and Trends of the Analytics Market by Jose FernandezState and Trends of the Analytics Market by Jose Fernandez
State and Trends of the Analytics Market by Jose Fernandez
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st Century
 
Why Sales and Marketing Specialists will become Big Data Scientists
Why Sales and Marketing Specialists  will become Big Data ScientistsWhy Sales and Marketing Specialists  will become Big Data Scientists
Why Sales and Marketing Specialists will become Big Data Scientists
 
BA and Beyond 20 - Bas Van Gils - Data management: from the trenches
BA and Beyond 20 - Bas Van Gils - Data management: from the trenchesBA and Beyond 20 - Bas Van Gils - Data management: from the trenches
BA and Beyond 20 - Bas Van Gils - Data management: from the trenches
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st Century
 
Data Scientist: the Sexiest Job of the 21st Century
Data Scientist: the Sexiest Job of the 21st CenturyData Scientist: the Sexiest Job of the 21st Century
Data Scientist: the Sexiest Job of the 21st Century
 
data is worthless if you don't communicate it
data is worthless if you don't communicate itdata is worthless if you don't communicate it
data is worthless if you don't communicate it
 
Big Data Analytics: A New Business Opportunity
Big Data Analytics: A New Business OpportunityBig Data Analytics: A New Business Opportunity
Big Data Analytics: A New Business Opportunity
 
Odgers Berndtson and Unico Big Data White Paper
Odgers Berndtson and Unico Big Data White PaperOdgers Berndtson and Unico Big Data White Paper
Odgers Berndtson and Unico Big Data White Paper
 
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...
Data is the New Oil: Presented By Naveen Narayanan, Global Client Partner of ...
 
State of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer GoodsState of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer Goods
 
Predictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallPredictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal Ball
 
Digital Catapult's Innovation Optimism Index
Digital Catapult's Innovation Optimism IndexDigital Catapult's Innovation Optimism Index
Digital Catapult's Innovation Optimism Index
 
PPT1-Buss Intel Analytics.pptx
PPT1-Buss Intel  Analytics.pptxPPT1-Buss Intel  Analytics.pptx
PPT1-Buss Intel Analytics.pptx
 

KĂŒrzlich hochgeladen

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Kochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRL
Kochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRLKochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRL
Kochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRLkantirani197
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.
❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.
❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.Nitya salvi
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSĂ©rgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

KĂŒrzlich hochgeladen (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Kochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRL
Kochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRLKochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRL
Kochi ❀CALL GIRL 84099*07087 ❀CALL GIRLS IN Kochi ESCORT SERVICE❀CALL GIRL
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.
❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.
❀Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💩✅.
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

  • 1. MANAGING UNCERTAINTY IN DATA THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY PROBLEMS MAURICE VAN KEULEN
  • 2. Paradigms of scientific method  Empiricism  Mathematical modeling  Simulation A new paradigm: Data-intensive Scientific Discovery  Combining and analyzing data in novel ways is capable of tackling research questions that could not be answered before 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 2 REVOLUTION IN SCIENTIFIC METHOD Bio-Informatics professor: “ PhD of 4 years, 3 years devoted to ‘data fiddling’ ”
  • 3. Research on pregnancy processes based on Electronic Patient Dossiers (EPDs) of some population of women  Select consult & treatment records from their EPDs from multiple sources  After first analysis one discovers many records not related to pregnancy (e.g., dermatologist consult)  Assumption that all records that belong to a pregnant woman are related to pregnancy is wrong, hence also the selection criterion!  There is no objective means to ascertain this such as a field ‘related to pregnancy’ 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 3 A FIRST STORY: PREGNANCY RESEARCH
  • 4.  A painstaking process follows with specifying filter rules and manually inspecting samples of results  Imperfect process so noisy records remain!  Wrong diagnoses cause more records to be erroneously in or out  more noisy records  Then, one looks at a sample and notices something strange in the times of consults: many appear close to each other and in the evening  Modification time of EPD record (what is recorded) does not reflect actual moment of activity (semantics)  sequence and duration noise 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 4 A FIRST STORY: PREGNANCY RESEARCH
  • 5. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 5 GEO-SOCIAL RECOMMENDATION: GPS TRAJECTORIES ‱ Detect visits from trajectories ‱ GPS traces from mobile phones ‱ Point-Of-Interest (POI) data harvested from the internet ‱ Purpose: construct profiles of ‱ Customers ‱ Products ‱ for recommendation ‱ Holiday homes ‱ Greeting cards
  • 6. Substantial amount of money involved in fraud. Ease of committing fraud incites otherwise decent people to do it as well. Danger to society  Inspect where there is a high risk of fraud  Example ISZW: labor market, labor circumstances, etc.  But: government data represents paper reality! Include traces from the internet (social media, web forums): Customers, employees, and by-standers leave behind observations and opinions  But natural language: about which company do they talk? 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 6 DATA-DRIVE FRAUD RISK ANALYSIS
  • 7.  Paris Hilton stayed in the Paris Hilton  Lady Gaga - Speechless live @ Helsinki 10/13/2010 http://www.youtube.com/watch?v=yREociHyijk . . . @ladygaga also talks about her Grampa who died recently  Laelith Demonia has just defeated liwanu Hird. Career wins is 575, career losses is 966.  Adding Win7Beta, Win2008, and Vista x64 and x86 images to munin. #wds  history should show that bush jr should be in jail or at least never should have been president 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 7 NATURAL LANGUAGE PROCESSING: AMBIGUITY ABOUNDS
  • 8.  Search (finding the needle in the haystack)  Information extraction from unstructured sources  Natural language processing  Web harvesting (both produce lower quality structured data)  Data quality management  Responsible analytics is (among other things) “Knowing how data quality problems in the source data affect the analytical results” 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 8 TECHNOLOGY WE WORK ON WE = DATABASES GROUP FROM UNIVERSITY OF TWENTE Equally true for Business Analytics
  • 9. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart 
 
 
 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 10 IMPACT OF DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000  Sample of data looks fine  Result of analysis looks perfectly reasonable  If you don’t look hard enough if you don’t properly pay attention to it 
 you will be unaware 
 that you are possibly looking at significantly erroneous figures!!!
  • 10. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart 
 
 
 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 11 IMPACT OF DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000 CustID Sales Name 6789 2 Tom 4567 6000 Jon 5678 NULL Nina 
 
 
 ???? Wrong figures included Missing figures Double counting etc. Many more problems at value, record, schema, source, trust levels
  • 11. Probabilistic database technology can store, query, analyze, reason with data taking into account possible influence on the results  Treats data quality problems as a fact of life  Responsible analytics: know deficiencies of results  Generic and scalable approach and technology  Nice properties for application: postpone- resolution/cleaning, pay-as-you-go; good-is-good- enough; human-in-the-loop 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 12 PROBABILISTIC DATABASES TO THE RESCUE
  • 12. Let’s go for an initial integration that can readily and meaningfully be used “Good is good enough” for meaningful use in many applications (can be achieved 10x earlier) Let it improve during use 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems PROBABILISTIC DATA INTEGRATION Use (analytics) Measure quality Improve data quality Partial data integration Enumerate cases for remaining problems Store data with uncertainty in UDBMS InitialintegrationContinuousimprovement 13 Postpon e problems Stop earlier Pay as you go Human in the loop
  • 13. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 14 COMBINING DATA 
 Keulen, M. (2012) Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). pp. 138-146. ISSN 1611-2776 Car brand Sales B.M.W. 25 Mercedes 32 Renault 10 Car brand Sales BMW 72 Mercedes-Benz 39 Renault 20 Car brand Sales Bayerische Motoren Werke 8 Mercedes 35 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45
  • 14. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 15 
 AND THE PROBLEM OF SEMANTIC DUPLICATES Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Preferred customers 
 SELECT SUM(Sales) FROM CarSales WHERE Sales>100 0 ‘No preferred customers’
  • 15. Database Real world (of car brands) Mercedes-Benz 39 72BMW 45Renault 67Mercedes 8 Bayerische Motoren Werke 25B.M.W. SalesCar brand ω d1 d2 d3 d4 d5 d6 o1 o2 o3 o4 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 16 SEMANTIC DUPLICATES
  • 16. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 17 MOST DATA QUALITY PROBLEMS CAN BE MODELED AS UNCERTAINTY IN DATA Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Mercedes 106 Mercedes-Benz 106 1 2 3 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1 X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct name 0.5 Y=1 “Mercedes-Benz” correct name 0.5 B.M.W. / BMW / Bayerische Motoren Werke analogously Run some duplicate detection tool
  • 17.  Looks like ordinary database  Several “possible” answers or approximate answers to queries  Important: Scalability (big data!) Sales of “preferred customers”  SELECT SUM(sales) FROM carsales WHERE sales≄ 100 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 18 IMPORTANT TOOL: PROBABILISTIC DATABASE SUM(sales) P 0 14% 105 6% 106 56% 211 24%
  • 18. Sales of “preferred customers”  SELECT SUM(sales) FROM carsales WHERE sales≄ 100  Answer: 106  Risk = Probability * Impact  Analyst only bothered with problems that matter 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 19 QUERYING AND RELIABILITY ASSESSMENT SUM(sales) P 0 14% 105 6% 106 56% 211 24% Second most likely answer at 24% with impact factor 2 in sales (211 vs 106) Risk of substantially wrong answer
  • 19. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 20 BACK TO GEO-SOCIAL RECOMMENDATION HOW TO MODEL THE GPS TRAJECTORY PROBLEM?  Smoothing: any jumps and/or sudden sharp angles are suspicious and probably wrong  Points become estimated points  Some points are possibly suspicious  Some are more suspicious than others Model the uncertainty explicitly in the data
  • 20. Fraud risk analysis  about which company do they talk?  Indicators become possible indicators  Fraud risk analysis is statistics / probability theory! Reasoning with possible indicators is very easy. It’s just more data 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 21 AMBIGUITY IN NATURAL LANGUAGE PROCESSING AND ITS CONSEQUENCES FOR FRAUD RISK ANALYSIS Paris Hilton stayed in the Paris Hilton Phrase begin end type ref Paris 1 1 City sws.geonames.org/ 2988507 Paris 1 1 Firstname Hilton 1 1 Lastname Paris Hilton 1 2 Person https://en.wikipedia.org/wi ki/Paris_Hilton Paris Hilton 1 2 Hotel www.hilton.com/Paris 
 
 
 
 “belong together”
  • 21.  Inspired from information retrieval (search engine evaluation)  Precision = ratio of answers that are correct (3/5 = 60%)  Recall = ratio of correct answers given (3/4 = 75%)  Expected precision and recall  A correct answer is better if the system dares to claim that it is correct with a higher probability  Analogously, incorrect answers with a high probability are worse than incorrect answers with a low probability  Expected precision = (0.8+0.7+0.2) / 2.3 = 74%  Expected recall = (0.8+0.7+0.2) / 4 = 43% 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 22 KNOW WHEN TO STOP CLEANING: MEASURING QUALITY A B C D E F G 80% 70% 50% 20% 10%
  • 22. Data quality: intangible problem with unknown impact The key to effective management of DQ problems  Model DQ problems as uncertainty *in* the data  Probabilistic database technology for scalability  Postpone resolution/cleaning: pay-as-you-go  Measure and know when to stop: good-is-good-enough; human-in-the-loop 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 23 CONCLUSIONS Bio-Informatics professor: “ PhD of 4 years, 3 years devoted to ‘data fiddling’ ” If we can reduce the data fiddling with 1 year (33%), we make the scientist twice as productive!

Hinweis der Redaktion

  1. Abstract Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or "Uncertain Database". Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc.  We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data
  2. Explain Bio-Informatics
  3. Of course there are others, e.g., BioInformatics
  4. First two examples showed data quality and semantical problems, if you do NLP you are faced with the same!
  5. Refer back to pregnancy and movie examples: all those issues can be modeled as uncertainty in data. Queries and analytics results will give all possible results, i.e., handle for influence on results
  6. With OSINT data, this problem of semantic duplicates is enormous .,..
  7. Notice that all these are “tables”
  8. TODO: deze slide wat explicieter / concreter maken
  9. Example with product information