Data cleansing

635 Aufrufe

Veröffentlicht am

My talk on data quality issues, tools to cleanse and some practices.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Data cleansing

  1. 1. DATA CLEANSING SKY YIN Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
  2. 2. DATA QUALITY ISSUES
  3. 3. MISSING DATA ▸ Null, empty string, 0, NA, N/A ▸ Find root cause ▸ Randomly missing or regular missing ▸ Fix missing data ▸ Skip ▸ Fill
  4. 4. DUPLICATED DATA ▸ Detect dups ▸ Unique count ▸ Root cause: bug or process or valid reason? ▸ Dup caused by typo, inconsistent format, spelling, and abbreviations ▸ Be careful on things look like dups but actually different ▸ People with same names
  5. 5. OUTLIERS ▸ Outlier detection ▸ Histogram is your friend ▸ Dealing with outliers ▸ Bug or exception ▸ Corrupted data ▸ Intentional wrong input: age, gender, post code
  6. 6. SUBTLE PROBLEMS ▸ Order in records ▸ Always sort. Don’t assume order ▸ Hidden link across records ▸ Duplicated session end bug ▸ Need rule-based detection ▸ Don’t know what you don’t know
  7. 7. BEYOND ISSUES ▸ Transforming ▸ Encoding ▸ Local time <—> UTC time ▸ Tidy data/normalization ▸ Storage optimization: Parquet, ORC ▸ Flexibility optimization: JSON
  8. 8. TOOLS
  9. 9. TEXT EXPLORATORY CLEANSING ▸ R: dataframe, data.table, dplyr ▸ Python: pandas, ipython notebook ▸ Open Refine ▸ Trifacta
  10. 10. TEXT PRODUCTION CLEANSING ▸ ETL ▸ Hadoop-based: Pig, Scalding ▸ Spark (can also be used for exploratory cleansing) ▸ ETL mangement ▸ AWS data pipeline ▸ Airbnb airflow
  11. 11. TEXT USE MACHINE LEARNING TO CLEANSING DATA ▸ Clustering ▸ Use similarity to find dups ▸ Use similarity to find difference
  12. 12. PRACTICES
  13. 13. TEXT GENERAL PRACTICES ▸ Data pipeline to automate the process ▸ Sushi principle: prefer raw data ▸ Prefer immutable than mutable ▸ Reproducible: scripts vs tools
  14. 14. TEXT MINOR DETAILS ▸ Approximate unique: hyperloglog ▸ Avoid incremental update on counts ▸ Save change if space permitting (S3) ▸ Upsert instead of insert: only effective for the first run
  15. 15. TEXT OPEN QUESTIONS ▸ Data versioning ▸ Data continuous validation ▸ Automated cleansing

×