SlideShare a Scribd company logo
1 of 25
Download to read offline
Basic concepts and workflow
Validation features
Outlook
Data validation infrastructure: the validate
package
Mark van der Loo and Edwin de Jonge
Statistics Netherlands
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Validate
Goal
To make checking your data against domain knowledge and
technical demands as easy as possible.
Content of this talk
Basic concepts and workflow
Examples of possibilities and syntax
Outlook
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Basic concepts and workflow
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Basic concepts of the validate package
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Example: retailers data
library(validate)
data(retailers)
dat <- retailers[4:6]
head(dat)
## turnover other.rev total.rev
## 1 NA NA 1130
## 2 1607 NA 1607
## 3 6886 -33 6919
## 4 3861 13 3874
## 5 NA 37 5602
## 6 25 NA 25
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Basic workflow
# define rules
v <- validator(turnover >= 0
, turnover + other.rev == total.rev)
# confront with data
cf <- confront(dat, v)
# analyze results
summary(cf)
## rule items passes fails nNA error warning
## 1 V1 60 56 0 4 FALSE FALSE
## 2 V2 60 19 4 37 FALSE FALSE
## expression
## 1 turnover >= 0
## 2 abs(turnover + other.rev - total.rev) < 1e-08
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Plot the validation
barplot(cf, main="retailers")
V1
V2
retailers
0 10 20 30 40 50 60
turnover >= 0
abs(turnover + other.rev − total.rev) < 1e−08
fails passes nNA
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Get all results
# A value For each item (record) and each rule
head(values(cf))
## V1 V2
## [1,] NA NA
## [2,] TRUE NA
## [3,] TRUE FALSE
## [4,] TRUE TRUE
## [5,] NA NA
## [6,] TRUE NA
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Shortcut using check_that()
cf <- check_that(dat
, turnover >= 0
, turnover + other.rev == total.rev)
Or, using the magrittr ‘not-a-pipe’ operator:
dat %>%
check_that(turnover >= 0
, turnover + other.rev == total.rev) %>%
summary()
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Read rules from file
### myrules.txt
# inequalities
turnover >= 0
other.rev >= 0
# balance rule
turnover + other.rev == total.rev
v <- validator(.file="myrules.txt")
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Validation features
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Validating types
Rule
Turnover is a numeric variable
Syntax
# any is.-function is valid.
is.numeric(turnover)
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Validating metadata
Rules
The variable total.rev must be present.
The number of rows must be at least 20
Syntax
# use the "." to access the dataset as a whole
"total.rev" %in% names(.)
nrow(.) >= 20
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Validating aggregates
Rule
Mean turnover must be at least 20% of mean total
revenue
Syntax
mean(total.rev,na.rm=TRUE) /
mean(turnover,na.rm=TRUE) >= 0.2
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Rules that express conditions
Rule
If the turnover is larger than zero, the total revenue must
be larger than zero.
Syntax
# executed for each row
if ( turnover > 0) total.rev > 0
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Functional dependencies
Rule
Two records with the same zip code, must have the same
city and street name.
zip → city + street
Syntax
zip ~ city + street
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Using transient variables
Rule
The turnover must be between 0.1 and 10 times its median
Syntax
# transient variable with the := operator
med := median(turnover,na.rm=TRUE)
turnover > 0.1*med
turnover < 10*med
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Variable groups
Rule
Turnover, other revenue, and total revenue must be
between 0 and 2000.
Syntax
G := var_group(turnover, other.rev, total.rev)
G >= 0
G <= 2000
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Referencing other datasets
Rule
The mean turnover of this year must not be more than 1.1
times last years mean.
Syntax
# use the "$" operator to reference other datasets
v <- validator(
mean(turnover, na.rm=TRUE) <
mean(lastyear$turnover,na.rm=TRUE))
cf <- confront(dat, v, ref=list(lastyear=dat_lastyear))
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Other features
Rules (validator objects)
Select from validator objects using []
Extract or set rule metadata (label, description, timestamp, . . .)
Get affected variable names, rule linkage
Summarize validators
Read/write to yaml format
Confront
Control behaviour on NA
Raise errors, warnings
Set machine rounding limit
vignette("intro","validate")
vignette("rule-files","validate")
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Outlook
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
In the works / ideas
More analyses of rules
More programmability
More (interactive) visualisations
Roxygen-like metadata specification
More support for reporting
. . .
We’d ♥ to hear your comments, suggestions, bugreports
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Validate is just the beginning!
See github.com/data-cleaning
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Literature
Van der Loo (2015) A formal typology of data validation
functions. in United Nations Economic Comission for Europe
Work Session on Statistical Data Editing , Budapest. [pdf]
Di Zio et al (2015) Methodology for data validation. ESSNet
on validation, deliverable. [pdf]
Van der Loo, M. and E. de Jonge (2016). Statistical Data
Cleaning with Applications in R, Wiley (in preparation).
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflow
Validation features
Outlook
Contact, links
Code, bugreports
cran.r-project.org/package=validate
github/data-cleaning/validate
This talk
slideshare.com/markvanderloo
Contact
mark.vanderloo@gmail.com @markvdloo
edwindjonge@gmail.com @edwindjonge
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

More Related Content

What's hot

Qualitative research method
Qualitative research methodQualitative research method
Qualitative research method
metalkid132
 
Les métriques de la science (ou La bibliométrie pour les nuls)
Les métriques de la science (ou La bibliométrie pour les nuls)Les métriques de la science (ou La bibliométrie pour les nuls)
Les métriques de la science (ou La bibliométrie pour les nuls)
URFIST de Paris
 
Research Design [Creswell]
Research Design [Creswell]Research Design [Creswell]
Research Design [Creswell]
jrgmckinney
 
Bibliometrics - an overview
Bibliometrics - an overviewBibliometrics - an overview
Bibliometrics - an overview
claudia cavicchi
 

What's hot (20)

bibliometrics for beginners
bibliometrics for beginnersbibliometrics for beginners
bibliometrics for beginners
 
An introduction to conducting a systematic literature review for social scien...
An introduction to conducting a systematic literature review for social scien...An introduction to conducting a systematic literature review for social scien...
An introduction to conducting a systematic literature review for social scien...
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료
 
Data standards in bioinformatics
Data standards in bioinformatics	Data standards in bioinformatics
Data standards in bioinformatics
 
Octagonal Fuzzy Transportation Problem Using Different Ranking Method
Octagonal Fuzzy Transportation Problem Using Different Ranking MethodOctagonal Fuzzy Transportation Problem Using Different Ranking Method
Octagonal Fuzzy Transportation Problem Using Different Ranking Method
 
Qualitative research method
Qualitative research methodQualitative research method
Qualitative research method
 
Deep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreathDeep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreath
 
Les métriques de la science (ou La bibliométrie pour les nuls)
Les métriques de la science (ou La bibliométrie pour les nuls)Les métriques de la science (ou La bibliométrie pour les nuls)
Les métriques de la science (ou La bibliométrie pour les nuls)
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Bba 3274 qm week 6 part 1 regression models
Bba 3274 qm week 6 part 1 regression modelsBba 3274 qm week 6 part 1 regression models
Bba 3274 qm week 6 part 1 regression models
 
Survival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta
Survival Data Analysis for Sekolah Tinggi Ilmu Statistik JakartaSurvival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta
Survival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta
 
OWL and OBO
OWL and OBOOWL and OBO
OWL and OBO
 
차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)
 
Multiple Regression
Multiple RegressionMultiple Regression
Multiple Regression
 
Ontology learning
Ontology learningOntology learning
Ontology learning
 
Research Design [Creswell]
Research Design [Creswell]Research Design [Creswell]
Research Design [Creswell]
 
Systematic review
Systematic reviewSystematic review
Systematic review
 
Bibliometrics - an overview
Bibliometrics - an overviewBibliometrics - an overview
Bibliometrics - an overview
 
Mixed models
Mixed modelsMixed models
Mixed models
 

Viewers also liked

ใบงาน วิชาเพศศึกษา
ใบงาน วิชาเพศศึกษาใบงาน วิชาเพศศึกษา
ใบงาน วิชาเพศศึกษา
Tepasoon Songnaa
 
ทักษะในการเล่นวอลเล่ย์บอล
ทักษะในการเล่นวอลเล่ย์บอลทักษะในการเล่นวอลเล่ย์บอล
ทักษะในการเล่นวอลเล่ย์บอล
Tepasoon Songnaa
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการ
Kittiya Youngjarean
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)
Kittiya Youngjarean
 
Leadership as pupose (journal)
Leadership as pupose (journal)Leadership as pupose (journal)
Leadership as pupose (journal)
LoverBhoii
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอล
Tepasoon Songnaa
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศ
Tepasoon Songnaa
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)
Kittiya Youngjarean
 
Digital Printing in Dubai
Digital Printing in DubaiDigital Printing in Dubai
Digital Printing in Dubai
Bruce Clay
 

Viewers also liked (20)

HariACMA
HariACMAHariACMA
HariACMA
 
Family
FamilyFamily
Family
 
ใบงาน วิชาเพศศึกษา
ใบงาน วิชาเพศศึกษาใบงาน วิชาเพศศึกษา
ใบงาน วิชาเพศศึกษา
 
ทักษะในการเล่นวอลเล่ย์บอล
ทักษะในการเล่นวอลเล่ย์บอลทักษะในการเล่นวอลเล่ย์บอล
ทักษะในการเล่นวอลเล่ย์บอล
 
Final start
Final startFinal start
Final start
 
Speed up your digital transformation
Speed up your digital transformationSpeed up your digital transformation
Speed up your digital transformation
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการ
 
Tugas 1
Tugas 1Tugas 1
Tugas 1
 
Enhance Pedadogies
Enhance PedadogiesEnhance Pedadogies
Enhance Pedadogies
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)
 
˹· 1 ǡѻ
˹· 1  ǡѻ˹· 1  ǡѻ
˹· 1 ǡѻ
 
TV3.0 New TV frontiers.
TV3.0 New TV frontiers.TV3.0 New TV frontiers.
TV3.0 New TV frontiers.
 
Leadership as pupose (journal)
Leadership as pupose (journal)Leadership as pupose (journal)
Leadership as pupose (journal)
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอล
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศ
 
Find the words_that_ryhme
Find the words_that_ryhmeFind the words_that_ryhme
Find the words_that_ryhme
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)
 
Bridging the gap between theory and practise
Bridging the gap between theory and practiseBridging the gap between theory and practise
Bridging the gap between theory and practise
 
Ejercicio 7
Ejercicio 7Ejercicio 7
Ejercicio 7
 
Digital Printing in Dubai
Digital Printing in DubaiDigital Printing in Dubai
Digital Printing in Dubai
 

Similar to Data validation infrastructure: the validate package

Elevate workshop programmatic_2014
Elevate workshop programmatic_2014Elevate workshop programmatic_2014
Elevate workshop programmatic_2014
David Scruggs
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
Paulo Gandra de Sousa
 
Portfolio For Charles Tontz
Portfolio For Charles TontzPortfolio For Charles Tontz
Portfolio For Charles Tontz
ctontz
 
oracle project accounting | best oracle project accounting training
oracle project accounting | best oracle project accounting trainingoracle project accounting | best oracle project accounting training
oracle project accounting | best oracle project accounting training
OnlineOracleTrainings
 

Similar to Data validation infrastructure: the validate package (20)

Elevate workshop programmatic_2014
Elevate workshop programmatic_2014Elevate workshop programmatic_2014
Elevate workshop programmatic_2014
 
Salesforce1 Platform ELEVATE LA workshop Dec 18, 2013
Salesforce1 Platform ELEVATE LA workshop Dec 18, 2013Salesforce1 Platform ELEVATE LA workshop Dec 18, 2013
Salesforce1 Platform ELEVATE LA workshop Dec 18, 2013
 
PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
Look Ma, No Apex: Mobile Apps with RemoteObject and Mobile SDK
Look Ma, No Apex: Mobile Apps with RemoteObject and Mobile SDKLook Ma, No Apex: Mobile Apps with RemoteObject and Mobile SDK
Look Ma, No Apex: Mobile Apps with RemoteObject and Mobile SDK
 
Learn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream managementLearn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream management
 
RFP Presentation Example
RFP Presentation ExampleRFP Presentation Example
RFP Presentation Example
 
Tdd,Ioc
Tdd,IocTdd,Ioc
Tdd,Ioc
 
Portfolio For Charles Tontz
Portfolio For Charles TontzPortfolio For Charles Tontz
Portfolio For Charles Tontz
 
Rpa conference new delhi
Rpa conference new delhiRpa conference new delhi
Rpa conference new delhi
 
BPCS Infor ERP LX Implementation Evaluation Q4 2007
BPCS Infor ERP LX Implementation Evaluation Q4 2007BPCS Infor ERP LX Implementation Evaluation Q4 2007
BPCS Infor ERP LX Implementation Evaluation Q4 2007
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
oracle project accounting | best oracle project accounting training
oracle project accounting | best oracle project accounting trainingoracle project accounting | best oracle project accounting training
oracle project accounting | best oracle project accounting training
 
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction ProfilerSplunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
 
Informatica dvo training
Informatica dvo training  Informatica dvo training
Informatica dvo training
 
Quickly Create Data Sets for the Analytics Cloud
Quickly Create Data Sets for the Analytics CloudQuickly Create Data Sets for the Analytics Cloud
Quickly Create Data Sets for the Analytics Cloud
 
measuring and monitoring client side performance / Nir Nahum
measuring and monitoring client side performance / Nir Nahummeasuring and monitoring client side performance / Nir Nahum
measuring and monitoring client side performance / Nir Nahum
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Bug Hunting with the Salesforce Developer Console
Bug Hunting with the Salesforce Developer ConsoleBug Hunting with the Salesforce Developer Console
Bug Hunting with the Salesforce Developer Console
 
Anusaa_Qlikview
Anusaa_QlikviewAnusaa_Qlikview
Anusaa_Qlikview
 

Recently uploaded

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Data validation infrastructure: the validate package

  • 1. Basic concepts and workflow Validation features Outlook Data validation infrastructure: the validate package Mark van der Loo and Edwin de Jonge Statistics Netherlands Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 2. Basic concepts and workflow Validation features Outlook Validate Goal To make checking your data against domain knowledge and technical demands as easy as possible. Content of this talk Basic concepts and workflow Examples of possibilities and syntax Outlook Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 3. Basic concepts and workflow Validation features Outlook Basic concepts and workflow Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 4. Basic concepts and workflow Validation features Outlook Basic concepts of the validate package Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 5. Basic concepts and workflow Validation features Outlook Example: retailers data library(validate) data(retailers) dat <- retailers[4:6] head(dat) ## turnover other.rev total.rev ## 1 NA NA 1130 ## 2 1607 NA 1607 ## 3 6886 -33 6919 ## 4 3861 13 3874 ## 5 NA 37 5602 ## 6 25 NA 25 Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 6. Basic concepts and workflow Validation features Outlook Basic workflow # define rules v <- validator(turnover >= 0 , turnover + other.rev == total.rev) # confront with data cf <- confront(dat, v) # analyze results summary(cf) ## rule items passes fails nNA error warning ## 1 V1 60 56 0 4 FALSE FALSE ## 2 V2 60 19 4 37 FALSE FALSE ## expression ## 1 turnover >= 0 ## 2 abs(turnover + other.rev - total.rev) < 1e-08 Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 7. Basic concepts and workflow Validation features Outlook Plot the validation barplot(cf, main="retailers") V1 V2 retailers 0 10 20 30 40 50 60 turnover >= 0 abs(turnover + other.rev − total.rev) < 1e−08 fails passes nNA Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 8. Basic concepts and workflow Validation features Outlook Get all results # A value For each item (record) and each rule head(values(cf)) ## V1 V2 ## [1,] NA NA ## [2,] TRUE NA ## [3,] TRUE FALSE ## [4,] TRUE TRUE ## [5,] NA NA ## [6,] TRUE NA Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 9. Basic concepts and workflow Validation features Outlook Shortcut using check_that() cf <- check_that(dat , turnover >= 0 , turnover + other.rev == total.rev) Or, using the magrittr ‘not-a-pipe’ operator: dat %>% check_that(turnover >= 0 , turnover + other.rev == total.rev) %>% summary() Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 10. Basic concepts and workflow Validation features Outlook Read rules from file ### myrules.txt # inequalities turnover >= 0 other.rev >= 0 # balance rule turnover + other.rev == total.rev v <- validator(.file="myrules.txt") Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 11. Basic concepts and workflow Validation features Outlook Validation features Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 12. Basic concepts and workflow Validation features Outlook Validating types Rule Turnover is a numeric variable Syntax # any is.-function is valid. is.numeric(turnover) Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 13. Basic concepts and workflow Validation features Outlook Validating metadata Rules The variable total.rev must be present. The number of rows must be at least 20 Syntax # use the "." to access the dataset as a whole "total.rev" %in% names(.) nrow(.) >= 20 Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 14. Basic concepts and workflow Validation features Outlook Validating aggregates Rule Mean turnover must be at least 20% of mean total revenue Syntax mean(total.rev,na.rm=TRUE) / mean(turnover,na.rm=TRUE) >= 0.2 Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 15. Basic concepts and workflow Validation features Outlook Rules that express conditions Rule If the turnover is larger than zero, the total revenue must be larger than zero. Syntax # executed for each row if ( turnover > 0) total.rev > 0 Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 16. Basic concepts and workflow Validation features Outlook Functional dependencies Rule Two records with the same zip code, must have the same city and street name. zip → city + street Syntax zip ~ city + street Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 17. Basic concepts and workflow Validation features Outlook Using transient variables Rule The turnover must be between 0.1 and 10 times its median Syntax # transient variable with the := operator med := median(turnover,na.rm=TRUE) turnover > 0.1*med turnover < 10*med Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 18. Basic concepts and workflow Validation features Outlook Variable groups Rule Turnover, other revenue, and total revenue must be between 0 and 2000. Syntax G := var_group(turnover, other.rev, total.rev) G >= 0 G <= 2000 Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 19. Basic concepts and workflow Validation features Outlook Referencing other datasets Rule The mean turnover of this year must not be more than 1.1 times last years mean. Syntax # use the "$" operator to reference other datasets v <- validator( mean(turnover, na.rm=TRUE) < mean(lastyear$turnover,na.rm=TRUE)) cf <- confront(dat, v, ref=list(lastyear=dat_lastyear)) Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 20. Basic concepts and workflow Validation features Outlook Other features Rules (validator objects) Select from validator objects using [] Extract or set rule metadata (label, description, timestamp, . . .) Get affected variable names, rule linkage Summarize validators Read/write to yaml format Confront Control behaviour on NA Raise errors, warnings Set machine rounding limit vignette("intro","validate") vignette("rule-files","validate") Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 21. Basic concepts and workflow Validation features Outlook Outlook Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 22. Basic concepts and workflow Validation features Outlook In the works / ideas More analyses of rules More programmability More (interactive) visualisations Roxygen-like metadata specification More support for reporting . . . We’d ♥ to hear your comments, suggestions, bugreports Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 23. Basic concepts and workflow Validation features Outlook Validate is just the beginning! See github.com/data-cleaning Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 24. Basic concepts and workflow Validation features Outlook Literature Van der Loo (2015) A formal typology of data validation functions. in United Nations Economic Comission for Europe Work Session on Statistical Data Editing , Budapest. [pdf] Di Zio et al (2015) Methodology for data validation. ESSNet on validation, deliverable. [pdf] Van der Loo, M. and E. de Jonge (2016). Statistical Data Cleaning with Applications in R, Wiley (in preparation). Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
  • 25. Basic concepts and workflow Validation features Outlook Contact, links Code, bugreports cran.r-project.org/package=validate github/data-cleaning/validate This talk slideshare.com/markvanderloo Contact mark.vanderloo@gmail.com @markvdloo edwindjonge@gmail.com @edwindjonge Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package