Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Data cleansing

700 Aufrufe

Veröffentlicht am

My talk on data quality issues, tools to cleanse and some practices.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Data cleansing

  1. 1. DATA CLEANSING SKY YIN Photo credit: http://outofmygord.com/2015/04/08/the-messy-part-of-marketing/
  2. 2. DATA QUALITY ISSUES
  3. 3. MISSING DATA ▸ Null, empty string, 0, NA, N/A ▸ Find root cause ▸ Randomly missing or regular missing ▸ Fix missing data ▸ Skip ▸ Fill
  4. 4. DUPLICATED DATA ▸ Detect dups ▸ Unique count ▸ Root cause: bug or process or valid reason? ▸ Dup caused by typo, inconsistent format, spelling, and abbreviations ▸ Be careful on things look like dups but actually different ▸ People with same names
  5. 5. OUTLIERS ▸ Outlier detection ▸ Histogram is your friend ▸ Dealing with outliers ▸ Bug or exception ▸ Corrupted data ▸ Intentional wrong input: age, gender, post code
  6. 6. SUBTLE PROBLEMS ▸ Order in records ▸ Always sort. Don’t assume order ▸ Hidden link across records ▸ Duplicated session end bug ▸ Need rule-based detection ▸ Don’t know what you don’t know
  7. 7. BEYOND ISSUES ▸ Transforming ▸ Encoding ▸ Local time <—> UTC time ▸ Tidy data/normalization ▸ Storage optimization: Parquet, ORC ▸ Flexibility optimization: JSON
  8. 8. TOOLS
  9. 9. TEXT EXPLORATORY CLEANSING ▸ R: dataframe, data.table, dplyr ▸ Python: pandas, ipython notebook ▸ Open Refine ▸ Trifacta
  10. 10. TEXT PRODUCTION CLEANSING ▸ ETL ▸ Hadoop-based: Pig, Scalding ▸ Spark (can also be used for exploratory cleansing) ▸ ETL mangement ▸ AWS data pipeline ▸ Airbnb airflow
  11. 11. TEXT USE MACHINE LEARNING TO CLEANSING DATA ▸ Clustering ▸ Use similarity to find dups ▸ Use similarity to find difference
  12. 12. PRACTICES
  13. 13. TEXT GENERAL PRACTICES ▸ Data pipeline to automate the process ▸ Sushi principle: prefer raw data ▸ Prefer immutable than mutable ▸ Reproducible: scripts vs tools
  14. 14. TEXT MINOR DETAILS ▸ Approximate unique: hyperloglog ▸ Avoid incremental update on counts ▸ Save change if space permitting (S3) ▸ Upsert instead of insert: only effective for the first run
  15. 15. TEXT OPEN QUESTIONS ▸ Data versioning ▸ Data continuous validation ▸ Automated cleansing

×