SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
DATA CLEANSING
SKY YIN
Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
DATA QUALITY ISSUES
MISSING DATA
▸ Null, empty string, 0, NA, N/A
▸ Find root cause
▸ Randomly missing or regular missing
▸ Fix missing data
▸ Skip
▸ Fill
DUPLICATED DATA
▸ Detect dups
▸ Unique count
▸ Root cause: bug or process or valid reason?
▸ Dup caused by typo, inconsistent format, spelling, and abbreviations
▸ Be careful on things look like dups but actually different
▸ People with same names
OUTLIERS
▸ Outlier detection
▸ Histogram is your friend
▸ Dealing with outliers
▸ Bug or exception
▸ Corrupted data
▸ Intentional wrong input: age, gender, post code
SUBTLE PROBLEMS
▸ Order in records
▸ Always sort. Don’t assume order
▸ Hidden link across records
▸ Duplicated session end bug
▸ Need rule-based detection
▸ Don’t know what you don’t know
BEYOND ISSUES
▸ Transforming
▸ Encoding
▸ Local time <—> UTC time
▸ Tidy data/normalization
▸ Storage optimization: Parquet, ORC
▸ Flexibility optimization: JSON
TOOLS
TEXT
EXPLORATORY CLEANSING
▸ R: dataframe, data.table, dplyr
▸ Python: pandas, ipython notebook
▸ Open Refine
▸ Trifacta
TEXT
PRODUCTION CLEANSING
▸ ETL
▸ Hadoop-based: Pig, Scalding
▸ Spark (can also be used for exploratory cleansing)
▸ ETL mangement
▸ AWS data pipeline
▸ Airbnb airflow
TEXT
USE MACHINE LEARNING TO CLEANSING DATA
▸ Clustering
▸ Use similarity to find dups
▸ Use similarity to find difference
PRACTICES
TEXT
GENERAL PRACTICES
▸ Data pipeline to automate the process
▸ Sushi principle: prefer raw data
▸ Prefer immutable than mutable
▸ Reproducible: scripts vs tools
TEXT
MINOR DETAILS
▸ Approximate unique: hyperloglog
▸ Avoid incremental update on counts
▸ Save change if space permitting (S3)
▸ Upsert instead of insert: only effective for the first run
TEXT
OPEN QUESTIONS
▸ Data versioning
▸ Data continuous validation
▸ Automated cleansing

Weitere ähnliche Inhalte

Andere mochten auch

Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansingng8
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningJennifer Morrow
 
Salesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin WebinarSalesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin WebinarSalesforce Admins
 
Salesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive SolutionsSalesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive SolutionsSalesforce Admins
 
Salesforce Spring 17 Release Overview
Salesforce Spring 17 Release OverviewSalesforce Spring 17 Release Overview
Salesforce Spring 17 Release OverviewRoy Gilad
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Blackbaud Pacific
 

Andere mochten auch (8)

Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
Salesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin WebinarSalesforce Spring '17 Release Admin Webinar
Salesforce Spring '17 Release Admin Webinar
 
Salesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive SolutionsSalesforce Admin Webinar: Processes Drive Solutions
Salesforce Admin Webinar: Processes Drive Solutions
 
Salesforce Spring 17 Release Overview
Salesforce Spring 17 Release OverviewSalesforce Spring 17 Release Overview
Salesforce Spring 17 Release Overview
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
 

Kürzlich hochgeladen

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Kürzlich hochgeladen (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

Data cleansing

  • 1. DATA CLEANSING SKY YIN Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
  • 3. MISSING DATA ▸ Null, empty string, 0, NA, N/A ▸ Find root cause ▸ Randomly missing or regular missing ▸ Fix missing data ▸ Skip ▸ Fill
  • 4. DUPLICATED DATA ▸ Detect dups ▸ Unique count ▸ Root cause: bug or process or valid reason? ▸ Dup caused by typo, inconsistent format, spelling, and abbreviations ▸ Be careful on things look like dups but actually different ▸ People with same names
  • 5. OUTLIERS ▸ Outlier detection ▸ Histogram is your friend ▸ Dealing with outliers ▸ Bug or exception ▸ Corrupted data ▸ Intentional wrong input: age, gender, post code
  • 6. SUBTLE PROBLEMS ▸ Order in records ▸ Always sort. Don’t assume order ▸ Hidden link across records ▸ Duplicated session end bug ▸ Need rule-based detection ▸ Don’t know what you don’t know
  • 7. BEYOND ISSUES ▸ Transforming ▸ Encoding ▸ Local time <—> UTC time ▸ Tidy data/normalization ▸ Storage optimization: Parquet, ORC ▸ Flexibility optimization: JSON
  • 9. TEXT EXPLORATORY CLEANSING ▸ R: dataframe, data.table, dplyr ▸ Python: pandas, ipython notebook ▸ Open Refine ▸ Trifacta
  • 10. TEXT PRODUCTION CLEANSING ▸ ETL ▸ Hadoop-based: Pig, Scalding ▸ Spark (can also be used for exploratory cleansing) ▸ ETL mangement ▸ AWS data pipeline ▸ Airbnb airflow
  • 11. TEXT USE MACHINE LEARNING TO CLEANSING DATA ▸ Clustering ▸ Use similarity to find dups ▸ Use similarity to find difference
  • 13. TEXT GENERAL PRACTICES ▸ Data pipeline to automate the process ▸ Sushi principle: prefer raw data ▸ Prefer immutable than mutable ▸ Reproducible: scripts vs tools
  • 14. TEXT MINOR DETAILS ▸ Approximate unique: hyperloglog ▸ Avoid incremental update on counts ▸ Save change if space permitting (S3) ▸ Upsert instead of insert: only effective for the first run
  • 15. TEXT OPEN QUESTIONS ▸ Data versioning ▸ Data continuous validation ▸ Automated cleansing