SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A systematic approach to data cleaning with R
Mark van der Loo
markvanderloo.eu | @markvdloo
Budapest | September 3 2016
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Demos and other materials
https://github.com/markvanderloo/satRday
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Contents
The statistical value chain
From raw data to technically correct data
Strings and encoding
Regexp and approximate matching
Type coercion
From technically correct data to consistent data
Data validation
Error localization
Correction, imputation, adjustment
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The statistical value chain
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Statistical value chain
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Concepts
Technically correct data
Well-defined format (data structure)
Well-defined types (numbers, date/time,string, categorical... )
Statistical units can be identified (persons, transactions, phone
calls...)
Variables can be identified as properties of statistical units.
Note: tidy data ⊂ technically correct data
Consistent data
Data satisfies demands from domain knowledge
(more on this when we talk about validation)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
From raw to technically correct data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Dirty tabular data
Demo
Coercing while reading: /table
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tabular data: long story short
read.table: R’s swiss army knife
fairly strict (no sniffing)
Very flexible
Interface could be cleaner (see this talk)
readr::read_csv
Easy to switch between strict/lenient parsing
Compact control over column types
Fast
Clear reports of parsing failure
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Really dirty data
Demo
Output file parsing: /parsing
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A few lessons from the demo
(base) R has great text processing tools.
Need to work with regular expressions1
Write many small functions extracting single data elements.
Don’t overgeneralize: adapt functions as you meet new input.
Smart use of existing tools (read.table(text=))
1
Mastering Regular Expressions (2006) by Jeffrey Friedl is a great resource
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Packages for standard format parsing
jsonlite: parse JSON files
yaml: parse yaml files
xml2: parse XML files
rvest: scrape and parse HTML files
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Some tips on regular expressions with R
stringr has many useful shorthands for common tasks.
Generate regular expressions with rex
library(rex)
# recognize a number in scientific notation
rex(one_or_more(digit)
, maybe(".",one_or_more(digit))
, "E" %or% "e"
, one_or_more(digit))
## (?:[[:digit:]])+(?:.(?:[[:digit:]])+)?(?:E|e)(?:[[:digi
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Regular expressions
Express a pattern of text, e.g.
"(a|b)c*" = {"a", "ac", "acc", . . . , "b", "bc", "bcc", . . .}
Task stringr function:
string detection str_detect(string, pattern)
string extraction str_extract(string, pattern)
string replacement str_extract(string, pattern, replacement)
string splitting str_split(string, pattern)
Base R: grep grepl | regexpr regmatches | sub gsub | strsplit
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
String normalization
Bring a text string in a standard format, e.g.
Standardize upper/lower case (casefolding)
stringr: str_to_lower, str_to_upper, str_to_title
base R: tolower, toupper
Remove accents (transliteration)
stringi: stri_trans_general
base R: iconv
Re-encoding
stringi: stri_encode
base R: iconv
Uniformize encoding (unicode normalization)
stringi: stri_trans_nfkc (and more)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Demo
Normalization, re-encoding, transliteration: /strings
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A few tips
Detect encoding stringi::stri_enc_detect
Conversion options iconvlist() stringi::stri_enc_list()
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching
Demo
Approximate matching and normalization: /matching
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching: edit-based distances
Allowed operation
Distance substitution deletion insertion transposition
Hamming    
LCS    
Levenshtein    
OSA    ∗
Damerau-
Levenshtein
   
∗Substrings may be edited only once.
leela → leea → leia
stringdist::stringdist(leela,leia,method=dl)
## [1] 2
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Some pointers for approximate matching
Normalisation and approximate matching are complementary
See my useR2014 talk or paper on stringdist for more distances
The fuzzyjoin package allows fuzzy joining of datasets
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Other good stuff
lubridate: extract dates from strings
lubridate::dmy(17 December 2015)
## [1] 2015-12-17
tidyr: many data cleaning operations to make your life easier
readr: Parse numbers from text strings
readr::parse_number(c(2%,6%,0.3%))
## [1] 2.0 6.0 0.3
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
From technically correct to consistent data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The mantra of data cleaning
Detection (data conflicts with domain knowledge)
Selection (find the value(s) that cause the violation)
Correction (replace them with better values)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Detection, AKA data validation
Informally:
Data Validation is checking data against (multivariate) expectations
about a data set.
Validation rules
Often these expectations can be expressed as a set of simple
validation rules.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Data validation
Demo
The validate package /validate
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The validate package, in summary
Make data validation rules explicit
Treat them as objects of computation
store to / read from file
manipulate
annotate
Confront data with rules
Analyze/visualize the results
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tracking changes when altering data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tracking changes in rule violations
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Use rules to correct data
Main idea
Rules restrict the data. Sometimes this is enough to derive a correct
value uniquely.
Examples
Correct typos in values under linear restrictions
123 + 45 = 177, but 123 + 54 = 177.
Derive imputations from values under linear restrictions
123 + NA = 177, compute 177 − 123 = 54.
Both can be generalized to systems Ax ≤ b.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Deductive correction and imputation
Demo
The deductive package: /deductive.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Selection, or: error localization
Fellegi and Holt (1976)
Find the least (weighted) number of fields that can be imputed such
that all rules can be satisfied.
Note
Solutions need not be unique.
Random one chosen in case of degeneracy.
Lowest weight need not guarantee smallest number of altered
variables.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Error localization
Demo
The errorlocate package: /errorlocate
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Notes on errorlocate
For in-record rules
Support for
linear (in)equality rules
Conditionals on categorical variables (if male then not pregnant)
Mixed conditionals (has job then age = 15)
Conditionals w/linear predicates (staff  0 then staff cost  0)
Optimization is mapped to MIP problem.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Missing values
Mechanisms (Rubin):
MCAR: missing completely at random
MAR: P(Y = NA) depends on value of X
MNAR: P(Y = NA) depends on value of Y
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Imputation
Purpose of imputation vs prediction
Prediction: estimate a single value (often for a single use)
Imputation: estimate values such that the completed data set
allows for valid inferencea
a
This is very difficult!
Imputation methods
Deductive imputation
Imputation based on predictive models
Donor imputation (knn, pmm, sequential/random hot deck )
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Predictive model-based imputation
ˆy = ˆf (x) +
e.g.Linear regression
ˆy = α + xT ˆβ +
Residual:
= 0 Impute expected value
drawn from observed residuals e
∼ N(0, σ) parametric residual, ˆσ2
= var(e)
Multiple imputation (Bayesian bootstrap)
Draw β from parametric distribution, impute multiple times.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Donor imputation (hot deck)
Method variants:
Random hot deck: copy value from random record.
Sequential hot deck: copy value from previous record.
k-nearest neighbours: draw donor from k neares neigbours
Predictive mean matching: copy value closest to prediction
Donor pool variants:
per variable
per missing data pattern
per record
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Note on multivariate donor imputation
Many multivariate methods seem relatively ad hoc, and more
theoretical and empirical comparisons with alternative approaches
would be of interest.
Andridge and Little (2010) A Review of Hot Deck Imputation for Survey
Non-response. Int. Stat. Rev. 78(1) 40–64
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Demo time
Demo
Imputation /imputation
VIM: visualisation, GUI, extensive methodology
simputation: simple, scriptable interface to common methods
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Methods supported by simputation
Model based (optionally add [non-]parametric random residual)
linear regression
robust linear regression
CART models
Random forest
Donor imputation (including various donor pool specifications)
k-nearest neigbour (based on gower’s distance)
sequential hotdeck (LOCF, NOCB)
random hotdeck
Predictive mean matching
Other
(groupwise) median imputation (optional random residual)
Proxy imputation (copy from other variable)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Credits
deductive Mark van der Loo, Edwin de Jonge
errorlocate Edwin de Jonge, Mark van der Loo
gower Mark van der Loo
jsonlite Jeroen Ooms, Duncan Temple Lang, Lloyd Hilaiel
magrittr Stefan Milton Bache, Hadley Wickham
rex Kevin Ushey Jim Hester, Robert Krzyzanowski
simputation Mark van der Loo
stringdist Mark van der Loo, Jan van der Laan, R Core, Nick Logan
stringi Marek Gagolewski, Bartek Tartanus
stringr Hadley Wickham, RStudio
tidyr Hadley Wickham, RStudio
validate Mark van der Loo, Edwin de Jonge
VIM Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner
xml2 Hadley Wickham, Jim Hester, Jeroen Ooms, RStudio, R foundation
Mark van der Loo A systematic approach to data cleaning with R

Weitere ähnliche Inhalte

Andere mochten auch

Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)
Neha Ahuja
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศ
Tepasoon Songnaa
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)
Kittiya Youngjarean
 
Luyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcLuyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin học
ltgiang87
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)
Kittiya Youngjarean
 
Meaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. SinielMeaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. Siniel
airazygy
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการ
Kittiya Youngjarean
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอล
Tepasoon Songnaa
 

Andere mochten auch (16)

Storyworld Jam '14
Storyworld Jam '14Storyworld Jam '14
Storyworld Jam '14
 
Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศ
 
Tugas 1
Tugas 1Tugas 1
Tugas 1
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)
 
Luyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcLuyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin học
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)
 
Meaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. SinielMeaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. Siniel
 
˹· 1 ǡѻ
˹· 1  ǡѻ˹· 1  ǡѻ
˹· 1 ǡѻ
 
Bridging the gap between theory and practise
Bridging the gap between theory and practiseBridging the gap between theory and practise
Bridging the gap between theory and practise
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการ
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอล
 
IoT Protocols
IoT ProtocolsIoT Protocols
IoT Protocols
 
TV3.0 New TV frontiers.
TV3.0 New TV frontiers.TV3.0 New TV frontiers.
TV3.0 New TV frontiers.
 
Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)
 
Graphene light bulb set for shops
Graphene light bulb set for shopsGraphene light bulb set for shops
Graphene light bulb set for shops
 

Ähnlich wie Sat rday

Ähnlich wie Sat rday (20)

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
 
HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics Platform
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
Lec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic DesignLec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic Design
 
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Understanding R for Epidemiologists
Understanding R for EpidemiologistsUnderstanding R for Epidemiologists
Understanding R for Epidemiologists
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaData Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by Cerved
 
computer architecture
computer architecture computer architecture
computer architecture
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rules
 
Cerved Datascience Milan
Cerved Datascience MilanCerved Datascience Milan
Cerved Datascience Milan
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Software Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and MetricsSoftware Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and Metrics
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 

Kürzlich hochgeladen

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Kürzlich hochgeladen (20)

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

Sat rday

  • 1. The statistical value chain From raw to technically correct data From technically correct to consistent data A systematic approach to data cleaning with R Mark van der Loo markvanderloo.eu | @markvdloo Budapest | September 3 2016 Mark van der Loo A systematic approach to data cleaning with R
  • 2. The statistical value chain From raw to technically correct data From technically correct to consistent data Demos and other materials https://github.com/markvanderloo/satRday Mark van der Loo A systematic approach to data cleaning with R
  • 3. The statistical value chain From raw to technically correct data From technically correct to consistent data Contents The statistical value chain From raw data to technically correct data Strings and encoding Regexp and approximate matching Type coercion From technically correct data to consistent data Data validation Error localization Correction, imputation, adjustment Mark van der Loo A systematic approach to data cleaning with R
  • 4. The statistical value chain From raw to technically correct data From technically correct to consistent data The statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  • 5. The statistical value chain From raw to technically correct data From technically correct to consistent data Statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  • 6. The statistical value chain From raw to technically correct data From technically correct to consistent data Concepts Technically correct data Well-defined format (data structure) Well-defined types (numbers, date/time,string, categorical... ) Statistical units can be identified (persons, transactions, phone calls...) Variables can be identified as properties of statistical units. Note: tidy data ⊂ technically correct data Consistent data Data satisfies demands from domain knowledge (more on this when we talk about validation) Mark van der Loo A systematic approach to data cleaning with R
  • 7. The statistical value chain From raw to technically correct data From technically correct to consistent data From raw to technically correct data Mark van der Loo A systematic approach to data cleaning with R
  • 8. The statistical value chain From raw to technically correct data From technically correct to consistent data Dirty tabular data Demo Coercing while reading: /table Mark van der Loo A systematic approach to data cleaning with R
  • 9. The statistical value chain From raw to technically correct data From technically correct to consistent data Tabular data: long story short read.table: R’s swiss army knife fairly strict (no sniffing) Very flexible Interface could be cleaner (see this talk) readr::read_csv Easy to switch between strict/lenient parsing Compact control over column types Fast Clear reports of parsing failure Mark van der Loo A systematic approach to data cleaning with R
  • 10. The statistical value chain From raw to technically correct data From technically correct to consistent data Really dirty data Demo Output file parsing: /parsing Mark van der Loo A systematic approach to data cleaning with R
  • 11. The statistical value chain From raw to technically correct data From technically correct to consistent data A few lessons from the demo (base) R has great text processing tools. Need to work with regular expressions1 Write many small functions extracting single data elements. Don’t overgeneralize: adapt functions as you meet new input. Smart use of existing tools (read.table(text=)) 1 Mastering Regular Expressions (2006) by Jeffrey Friedl is a great resource Mark van der Loo A systematic approach to data cleaning with R
  • 12. The statistical value chain From raw to technically correct data From technically correct to consistent data Packages for standard format parsing jsonlite: parse JSON files yaml: parse yaml files xml2: parse XML files rvest: scrape and parse HTML files Mark van der Loo A systematic approach to data cleaning with R
  • 13. The statistical value chain From raw to technically correct data From technically correct to consistent data Some tips on regular expressions with R stringr has many useful shorthands for common tasks. Generate regular expressions with rex library(rex) # recognize a number in scientific notation rex(one_or_more(digit) , maybe(".",one_or_more(digit)) , "E" %or% "e" , one_or_more(digit)) ## (?:[[:digit:]])+(?:.(?:[[:digit:]])+)?(?:E|e)(?:[[:digi Mark van der Loo A systematic approach to data cleaning with R
  • 14. The statistical value chain From raw to technically correct data From technically correct to consistent data Regular expressions Express a pattern of text, e.g. "(a|b)c*" = {"a", "ac", "acc", . . . , "b", "bc", "bcc", . . .} Task stringr function: string detection str_detect(string, pattern) string extraction str_extract(string, pattern) string replacement str_extract(string, pattern, replacement) string splitting str_split(string, pattern) Base R: grep grepl | regexpr regmatches | sub gsub | strsplit Mark van der Loo A systematic approach to data cleaning with R
  • 15. The statistical value chain From raw to technically correct data From technically correct to consistent data String normalization Bring a text string in a standard format, e.g. Standardize upper/lower case (casefolding) stringr: str_to_lower, str_to_upper, str_to_title base R: tolower, toupper Remove accents (transliteration) stringi: stri_trans_general base R: iconv Re-encoding stringi: stri_encode base R: iconv Uniformize encoding (unicode normalization) stringi: stri_trans_nfkc (and more) Mark van der Loo A systematic approach to data cleaning with R
  • 16. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding Mark van der Loo A systematic approach to data cleaning with R
  • 17. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  • 18. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  • 19. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Demo Normalization, re-encoding, transliteration: /strings Mark van der Loo A systematic approach to data cleaning with R
  • 20. The statistical value chain From raw to technically correct data From technically correct to consistent data A few tips Detect encoding stringi::stri_enc_detect Conversion options iconvlist() stringi::stri_enc_list() Mark van der Loo A systematic approach to data cleaning with R
  • 21. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Mark van der Loo A systematic approach to data cleaning with R
  • 22. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Demo Approximate matching and normalization: /matching Mark van der Loo A systematic approach to data cleaning with R
  • 23. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching: edit-based distances Allowed operation Distance substitution deletion insertion transposition Hamming LCS Levenshtein OSA ∗ Damerau- Levenshtein ∗Substrings may be edited only once. leela → leea → leia stringdist::stringdist(leela,leia,method=dl) ## [1] 2 Mark van der Loo A systematic approach to data cleaning with R
  • 24. The statistical value chain From raw to technically correct data From technically correct to consistent data Some pointers for approximate matching Normalisation and approximate matching are complementary See my useR2014 talk or paper on stringdist for more distances The fuzzyjoin package allows fuzzy joining of datasets Mark van der Loo A systematic approach to data cleaning with R
  • 25. The statistical value chain From raw to technically correct data From technically correct to consistent data Other good stuff lubridate: extract dates from strings lubridate::dmy(17 December 2015) ## [1] 2015-12-17 tidyr: many data cleaning operations to make your life easier readr: Parse numbers from text strings readr::parse_number(c(2%,6%,0.3%)) ## [1] 2.0 6.0 0.3 Mark van der Loo A systematic approach to data cleaning with R
  • 26. The statistical value chain From raw to technically correct data From technically correct to consistent data From technically correct to consistent data Mark van der Loo A systematic approach to data cleaning with R
  • 27. The statistical value chain From raw to technically correct data From technically correct to consistent data The mantra of data cleaning Detection (data conflicts with domain knowledge) Selection (find the value(s) that cause the violation) Correction (replace them with better values) Mark van der Loo A systematic approach to data cleaning with R
  • 28. The statistical value chain From raw to technically correct data From technically correct to consistent data Detection, AKA data validation Informally: Data Validation is checking data against (multivariate) expectations about a data set. Validation rules Often these expectations can be expressed as a set of simple validation rules. Mark van der Loo A systematic approach to data cleaning with R
  • 29. The statistical value chain From raw to technically correct data From technically correct to consistent data Data validation Demo The validate package /validate Mark van der Loo A systematic approach to data cleaning with R
  • 30. The statistical value chain From raw to technically correct data From technically correct to consistent data The validate package, in summary Make data validation rules explicit Treat them as objects of computation store to / read from file manipulate annotate Confront data with rules Analyze/visualize the results Mark van der Loo A systematic approach to data cleaning with R
  • 31. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes when altering data Mark van der Loo A systematic approach to data cleaning with R
  • 32. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes in rule violations Mark van der Loo A systematic approach to data cleaning with R
  • 33. The statistical value chain From raw to technically correct data From technically correct to consistent data Use rules to correct data Main idea Rules restrict the data. Sometimes this is enough to derive a correct value uniquely. Examples Correct typos in values under linear restrictions 123 + 45 = 177, but 123 + 54 = 177. Derive imputations from values under linear restrictions 123 + NA = 177, compute 177 − 123 = 54. Both can be generalized to systems Ax ≤ b. Mark van der Loo A systematic approach to data cleaning with R
  • 34. The statistical value chain From raw to technically correct data From technically correct to consistent data Deductive correction and imputation Demo The deductive package: /deductive. Mark van der Loo A systematic approach to data cleaning with R
  • 35. The statistical value chain From raw to technically correct data From technically correct to consistent data Selection, or: error localization Fellegi and Holt (1976) Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied. Note Solutions need not be unique. Random one chosen in case of degeneracy. Lowest weight need not guarantee smallest number of altered variables. Mark van der Loo A systematic approach to data cleaning with R
  • 36. The statistical value chain From raw to technically correct data From technically correct to consistent data Error localization Demo The errorlocate package: /errorlocate Mark van der Loo A systematic approach to data cleaning with R
  • 37. The statistical value chain From raw to technically correct data From technically correct to consistent data Notes on errorlocate For in-record rules Support for linear (in)equality rules Conditionals on categorical variables (if male then not pregnant) Mixed conditionals (has job then age = 15) Conditionals w/linear predicates (staff 0 then staff cost 0) Optimization is mapped to MIP problem. Mark van der Loo A systematic approach to data cleaning with R
  • 38. The statistical value chain From raw to technically correct data From technically correct to consistent data Missing values Mechanisms (Rubin): MCAR: missing completely at random MAR: P(Y = NA) depends on value of X MNAR: P(Y = NA) depends on value of Y Mark van der Loo A systematic approach to data cleaning with R
  • 39. The statistical value chain From raw to technically correct data From technically correct to consistent data Imputation Purpose of imputation vs prediction Prediction: estimate a single value (often for a single use) Imputation: estimate values such that the completed data set allows for valid inferencea a This is very difficult! Imputation methods Deductive imputation Imputation based on predictive models Donor imputation (knn, pmm, sequential/random hot deck ) Mark van der Loo A systematic approach to data cleaning with R
  • 40. The statistical value chain From raw to technically correct data From technically correct to consistent data Predictive model-based imputation ˆy = ˆf (x) + e.g.Linear regression ˆy = α + xT ˆβ + Residual: = 0 Impute expected value drawn from observed residuals e ∼ N(0, σ) parametric residual, ˆσ2 = var(e) Multiple imputation (Bayesian bootstrap) Draw β from parametric distribution, impute multiple times. Mark van der Loo A systematic approach to data cleaning with R
  • 41. The statistical value chain From raw to technically correct data From technically correct to consistent data Donor imputation (hot deck) Method variants: Random hot deck: copy value from random record. Sequential hot deck: copy value from previous record. k-nearest neighbours: draw donor from k neares neigbours Predictive mean matching: copy value closest to prediction Donor pool variants: per variable per missing data pattern per record Mark van der Loo A systematic approach to data cleaning with R
  • 42. The statistical value chain From raw to technically correct data From technically correct to consistent data Note on multivariate donor imputation Many multivariate methods seem relatively ad hoc, and more theoretical and empirical comparisons with alternative approaches would be of interest. Andridge and Little (2010) A Review of Hot Deck Imputation for Survey Non-response. Int. Stat. Rev. 78(1) 40–64 Mark van der Loo A systematic approach to data cleaning with R
  • 43. The statistical value chain From raw to technically correct data From technically correct to consistent data Demo time Demo Imputation /imputation VIM: visualisation, GUI, extensive methodology simputation: simple, scriptable interface to common methods Mark van der Loo A systematic approach to data cleaning with R
  • 44. The statistical value chain From raw to technically correct data From technically correct to consistent data Methods supported by simputation Model based (optionally add [non-]parametric random residual) linear regression robust linear regression CART models Random forest Donor imputation (including various donor pool specifications) k-nearest neigbour (based on gower’s distance) sequential hotdeck (LOCF, NOCB) random hotdeck Predictive mean matching Other (groupwise) median imputation (optional random residual) Proxy imputation (copy from other variable) Mark van der Loo A systematic approach to data cleaning with R
  • 45. The statistical value chain From raw to technically correct data From technically correct to consistent data Credits deductive Mark van der Loo, Edwin de Jonge errorlocate Edwin de Jonge, Mark van der Loo gower Mark van der Loo jsonlite Jeroen Ooms, Duncan Temple Lang, Lloyd Hilaiel magrittr Stefan Milton Bache, Hadley Wickham rex Kevin Ushey Jim Hester, Robert Krzyzanowski simputation Mark van der Loo stringdist Mark van der Loo, Jan van der Laan, R Core, Nick Logan stringi Marek Gagolewski, Bartek Tartanus stringr Hadley Wickham, RStudio tidyr Hadley Wickham, RStudio validate Mark van der Loo, Edwin de Jonge VIM Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner xml2 Hadley Wickham, Jim Hester, Jeroen Ooms, RStudio, R foundation Mark van der Loo A systematic approach to data cleaning with R