3. MISSING DATA
▸ Null, empty string, 0, NA, N/A
▸ Find root cause
▸ Randomly missing or regular missing
▸ Fix missing data
▸ Skip
▸ Fill
4. DUPLICATED DATA
▸ Detect dups
▸ Unique count
▸ Root cause: bug or process or valid reason?
▸ Dup caused by typo, inconsistent format, spelling, and abbreviations
▸ Be careful on things look like dups but actually different
▸ People with same names
5. OUTLIERS
▸ Outlier detection
▸ Histogram is your friend
▸ Dealing with outliers
▸ Bug or exception
▸ Corrupted data
▸ Intentional wrong input: age, gender, post code
6. SUBTLE PROBLEMS
▸ Order in records
▸ Always sort. Don’t assume order
▸ Hidden link across records
▸ Duplicated session end bug
▸ Need rule-based detection
▸ Don’t know what you don’t know
7. BEYOND ISSUES
▸ Transforming
▸ Encoding
▸ Local time <—> UTC time
▸ Tidy data/normalization
▸ Storage optimization: Parquet, ORC
▸ Flexibility optimization: JSON
10. TEXT
PRODUCTION CLEANSING
▸ ETL
▸ Hadoop-based: Pig, Scalding
▸ Spark (can also be used for exploratory cleansing)
▸ ETL mangement
▸ AWS data pipeline
▸ Airbnb airflow
11. TEXT
USE MACHINE LEARNING TO CLEANSING DATA
▸ Clustering
▸ Use similarity to find dups
▸ Use similarity to find difference
13. TEXT
GENERAL PRACTICES
▸ Data pipeline to automate the process
▸ Sushi principle: prefer raw data
▸ Prefer immutable than mutable
▸ Reproducible: scripts vs tools
14. TEXT
MINOR DETAILS
▸ Approximate unique: hyperloglog
▸ Avoid incremental update on counts
▸ Save change if space permitting (S3)
▸ Upsert instead of insert: only effective for the first run