Data cleansing

•

3 gefällt mir•1,093 views

Sky Yin

My talk on data quality issues, tools to cleanse and some practices.

Daten & Analysen

DATA CLEANSING
SKY YIN
Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/

MISSING DATA
▸ Null, empty string, 0, NA, N/A
▸ Find root cause
▸ Randomly missing or regular missing
▸ Fix missing data
▸ Skip
▸ Fill

DUPLICATED DATA
▸ Detect dups
▸ Unique count
▸ Root cause: bug or process or valid reason?
▸ Dup caused by typo, inconsistent format, spelling, and abbreviations
▸ Be careful on things look like dups but actually different
▸ People with same names

OUTLIERS
▸ Outlier detection
▸ Histogram is your friend
▸ Dealing with outliers
▸ Bug or exception
▸ Corrupted data
▸ Intentional wrong input: age, gender, post code

SUBTLE PROBLEMS
▸ Order in records
▸ Always sort. Don’t assume order
▸ Hidden link across records
▸ Duplicated session end bug
▸ Need rule-based detection
▸ Don’t know what you don’t know

BEYOND ISSUES
▸ Transforming
▸ Encoding
▸ Local time <—> UTC time
▸ Tidy data/normalization
▸ Storage optimization: Parquet, ORC
▸ Flexibility optimization: JSON

TEXT
EXPLORATORY CLEANSING
▸ R: dataframe, data.table, dplyr
▸ Python: pandas, ipython notebook
▸ Open Reﬁne
▸ Trifacta

TEXT
PRODUCTION CLEANSING
▸ ETL
▸ Hadoop-based: Pig, Scalding
▸ Spark (can also be used for exploratory cleansing)
▸ ETL mangement
▸ AWS data pipeline
▸ Airbnb airﬂow

TEXT
USE MACHINE LEARNING TO CLEANSING DATA
▸ Clustering
▸ Use similarity to ﬁnd dups
▸ Use similarity to ﬁnd difference

TEXT
GENERAL PRACTICES
▸ Data pipeline to automate the process
▸ Sushi principle: prefer raw data
▸ Prefer immutable than mutable
▸ Reproducible: scripts vs tools

TEXT
MINOR DETAILS
▸ Approximate unique: hyperloglog
▸ Avoid incremental update on counts
▸ Save change if space permitting (S3)
▸ Upsert instead of insert: only effective for the ﬁrst run

TEXT
OPEN QUESTIONS
▸ Data versioning
▸ Data continuous validation
▸ Automated cleansing

Empfohlen

Data CleansingPenn State EdTech Network

3rd Annual Salesforce Administrator Survey Results - conducted by CloudingoCloudingo

Data analysis and cleansingDemandGen

Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek

Lightning connect london'15agarciaodeian

Apex for Admins: Get Started with Apex in 30 Minutes! (part 1)Salesforce Developers

The Data Cleansing Process - A Roadmap to Material Master Data QualityI.M.A. Ltd.

Data Quality - The Cleansing ProcessInfoCheckPoint

Empfohlen

Data CleansingPenn State EdTech Network

3rd Annual Salesforce Administrator Survey Results - conducted by CloudingoCloudingo

Data analysis and cleansingDemandGen

Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek

Lightning connect london'15agarciaodeian

Apex for Admins: Get Started with Apex in 30 Minutes! (part 1)Salesforce Developers

The Data Cleansing Process - A Roadmap to Material Master Data QualityI.M.A. Ltd.

Data Quality - The Cleansing ProcessInfoCheckPoint

Presentation on Data Cleansingng8

Brief Introduction to the 12 Steps of Evaluation Data CleaningJennifer Morrow

Salesforce Spring '17 Release Admin WebinarSalesforce Admins

Salesforce Admin Webinar: Processes Drive SolutionsSalesforce Admins

Salesforce Spring 17 Release OverviewRoy Gilad

Data cleansingkunaljain1701

Data Cleaning TechniquesAmir Masoud Sefidian

Best practice strategies to clean up and maintain your database with Hether G...Blackbaud Pacific

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

IMA MSN - Medical Students Network (2).pptxdolaknnilon

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Weitere ähnliche Inhalte

Andere mochten auch

Presentation on Data Cleansingng8

Brief Introduction to the 12 Steps of Evaluation Data CleaningJennifer Morrow

Salesforce Spring '17 Release Admin WebinarSalesforce Admins

Salesforce Admin Webinar: Processes Drive SolutionsSalesforce Admins

Salesforce Spring 17 Release OverviewRoy Gilad

Data cleansingkunaljain1701

Data Cleaning TechniquesAmir Masoud Sefidian

Best practice strategies to clean up and maintain your database with Hether G...Blackbaud Pacific

Andere mochten auch (8)

Presentation on Data Cleansing

Brief Introduction to the 12 Steps of Evaluation Data Cleaning

Salesforce Spring '17 Release Admin Webinar

Salesforce Admin Webinar: Processes Drive Solutions

Salesforce Spring 17 Release Overview

Data cleansing

Data Cleaning Techniques

Best practice strategies to clean up and maintain your database with Hether G...

Kürzlich hochgeladen

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

IMA MSN - Medical Students Network (2).pptxdolaknnilon

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Multiple time frame trading analysis -brianshannon.pdfchwongval

Kürzlich hochgeladen (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

IMA MSN - Medical Students Network (2).pptx

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Customer Service Analytics - Make Sense of All Your Data.pptx

9654467111 Call Girls In Munirka Hotel And Home Service

MK KOMUNIKASI DATA (TI)komdat komdat.docx

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

20240419 - Measurecamp Amsterdam - SAM.pdf

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Generative AI for Social Good at Open Data Science East 2024

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

GA4 Without Cookies [Measure Camp AMS]

Multiple time frame trading analysis -brianshannon.pdf

Data cleansing

1. DATA CLEANSING SKY YIN Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/

2. DATA QUALITY ISSUES

3. MISSING DATA ▸ Null, empty string, 0, NA, N/A ▸ Find root cause ▸ Randomly missing or regular missing ▸ Fix missing data ▸ Skip ▸ Fill

4. DUPLICATED DATA ▸ Detect dups ▸ Unique count ▸ Root cause: bug or process or valid reason? ▸ Dup caused by typo, inconsistent format, spelling, and abbreviations ▸ Be careful on things look like dups but actually different ▸ People with same names

5. OUTLIERS ▸ Outlier detection ▸ Histogram is your friend ▸ Dealing with outliers ▸ Bug or exception ▸ Corrupted data ▸ Intentional wrong input: age, gender, post code

6. SUBTLE PROBLEMS ▸ Order in records ▸ Always sort. Don’t assume order ▸ Hidden link across records ▸ Duplicated session end bug ▸ Need rule-based detection ▸ Don’t know what you don’t know

7. BEYOND ISSUES ▸ Transforming ▸ Encoding ▸ Local time <—> UTC time ▸ Tidy data/normalization ▸ Storage optimization: Parquet, ORC ▸ Flexibility optimization: JSON

8. TOOLS

9. TEXT EXPLORATORY CLEANSING ▸ R: dataframe, data.table, dplyr ▸ Python: pandas, ipython notebook ▸ Open Reﬁne ▸ Trifacta

10. TEXT PRODUCTION CLEANSING ▸ ETL ▸ Hadoop-based: Pig, Scalding ▸ Spark (can also be used for exploratory cleansing) ▸ ETL mangement ▸ AWS data pipeline ▸ Airbnb airﬂow

11. TEXT USE MACHINE LEARNING TO CLEANSING DATA ▸ Clustering ▸ Use similarity to ﬁnd dups ▸ Use similarity to ﬁnd difference

12. PRACTICES

13. TEXT GENERAL PRACTICES ▸ Data pipeline to automate the process ▸ Sushi principle: prefer raw data ▸ Prefer immutable than mutable ▸ Reproducible: scripts vs tools

14. TEXT MINOR DETAILS ▸ Approximate unique: hyperloglog ▸ Avoid incremental update on counts ▸ Save change if space permitting (S3) ▸ Upsert instead of insert: only effective for the ﬁrst run

15. TEXT OPEN QUESTIONS ▸ Data versioning ▸ Data continuous validation ▸ Automated cleansing