Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Machine learning on non curated data

2.625 Aufrufe

Veröffentlicht am

Industry surveys [1] reveal that the number one hassle of data scientists is cleaning the data to analyze it. Textbook statistical modeling is sufficient for noisy signals, but errors of a discrete nature break standard tools of machine learning. I will discuss how to easily run machine learning on data tables with two common dirty-data problems: missing values and non-normalized entries. On both problems, I will show how to run standard machine-learning tools such as scikit-learn in the presence of such errors. The talk will be didactic and will discuss simple software solutions. It will build on the latest improvements to scikit-learn for preprocessing and missing values and the DirtyCat package [2] for non normalized entries. I will also summarize theoretical analyses in recent machine learning publications.

This talk targets data practitioners. Its goal are to help data scientists to be more efficient analysing data with such errors and understanding their impacts.

With missing values, I will use simple arguments and examples to outline how to obtain asymptotically good predictions [3]. Two components are key: imputation and adding an indicator of missingness. I will explain theoretical guidelines for these, and I will show how to implement these ideas in practice, with scikit-learn as a learner, or as a preprocesser.

For non-normalized categories, I will show that using their string representations to “vectorize” them, creating vectorial representations gives a simple but powerful solution that can be plugged in standard statistical analysis tools [4].

[1] Kaggle, the state of ML and data science 2017 https://www.kaggle.com/surveys/2017
[2] https://dirty-cat.github.io/stable/
[3] Josse Julie, Prost Nicolas, Scornet Erwan, and Varoquaux Gaël (2019). “On the consistency of supervised learning with missing values”. https://arxiv.org/abs/1902.06931
[4] Cerda Patricio, Varoquaux Gaël, and Kégl Balázs. "Similarity encoding for learning with dirty categorical variables." Machine Learning 107.8-10 (2018): 1477 https://arxiv.org/abs/1806.00979

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

Machine learning on non curated data

  1. 1. Machine learning on non curated data Dirty data made easy (in Python ) Ga¨el Varoquaux,
  2. 2. Machine learning on non curated data Dirty data made easy (in Python ) Ga¨el Varoquaux,
  3. 3. With scikit-learn, machine learning is easy and fun The problem is getting the data into the learner
  4. 4. With scikit-learn, machine learning is easy and fun The problem is getting the data into the learner www.kaggle.com/ash316/novice- to-grandmaster
  5. 5. Machine learning Let X ∈ Rn×p or a numpy array
  6. 6. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I
  7. 7. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I sklearn.compose.Column Transformer Apply different preprocessing per columns
  8. 8. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Dirty Categories
  9. 9. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Missing values
  10. 10. Talk outline 1 Column transforming 2 Encoding dirty categories 3 Learning with missing values Python + scikit-learn data mining research statistics research G Varoquaux 4
  11. 11. 1 Column transforming Pandas in, numpy out (preprocessing) G Varoquaux 5
  12. 12. 1 Dataframes to numbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical G Varoquaux 6
  13. 13. 1 Dataframes to numbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical Gender: One-hot encode one hot enc = sklearn. preprocessing .OneHotEncoder() one hot enc. fit transform (df[[’Gender’]]) Gender (M) Gender (F) ... 1 0 0 1 1 0 0 1G Varoquaux 6
  14. 14. 1 Dataframes to numbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical Gender: One-hot encode Date: use pandas’ datetime support d a t e s = pd. t o d a t e t i m e ( df [’Date First Hired ’]) # the values hold the data in secs d a t e s . v a l u e s . a s t y p e (float) G Varoquaux 6
  15. 15. 1 Transformers: fit & transform Separating fitting from transforming Avoids data leakage Can be used in a Pipeline and cross val score One-hot encoder one hot enc. fit (df[[’Gender’]]) X = one hot enc.transform(df[[’Gender’]]) 1) store which categories are present 2) encode the data accordingly Better than pd.get dummies because columns are defined from train set, and do not change with test set G Varoquaux 7
  16. 16. 1 Transformers: fit & transform Separating fitting from transforming Avoids data leakage Can be used in a Pipeline and cross val score For dates: FunctionTransformer def date2num ( d a t e s t r ): out = pd. t o d a t e t i m e ( d a t e s t r ). v a l u e s . a s t y p e (np.float) return out . r e s h a p e ((-1, 1)) # 2D output d a t e t r a n s = p r e p r o c e s s i n g . F u n c t i o n T r a n s f o r m e r ( func =date2num , v a l i d a t e = F a l s e ) X = d a t e t r a n s . t r a n s f o r m ( df [’Date First Hired ’] G Varoquaux 7
  17. 17. 1 ColumnTransformer: assembling Applies different transformers to columns These can be complex pipelines c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r ( ( one hot enc , [’Gender ’, ’Employee Position Title ’]), ( d a t e t r a n s , ’Date First Hired ’), ) X = c o l u m n t r a n s . f i t t r a n s f o r m ( df ) From DataFrame to array with heteroge- neous preprocessing & feature engineering G Varoquaux 8
  18. 18. 1 ColumnTransformer: assembling Applies different transformers to columns These can be complex pipelines c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r ( ( one hot enc , [’Gender ’, ’Employee Position Title ’]), ( d a t e t r a n s , ’Date First Hired ’), ) X = c o l u m n t r a n s . f i t t r a n s f o r m ( df ) From DataFrame to array with heteroge- neous preprocessing & feature engineering Benefit: model selection on dataframe model = make pipeline(column trans, HistGradientBoostingClassifier) scores = cross val score(model, df, y) G Varoquaux 8
  19. 19. 2 Encoding dirty categories PhD word of Patricio Cerda [Cerda... 2018] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I
  20. 20. 2 The problem of dirty categories Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Break OneHotEncoder Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 10
  21. 21. 2 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 11
  22. 22. 2 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision Potentially suboptimal Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 11
  23. 23. 2 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Harder than supervised learning G Varoquaux 11
  24. 24. Our goal: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 12
  25. 25. 2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p new categories? link categories? Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 13
  26. 26. 2 Some string similarities Levenshtein Number of edit on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters 3-gram1 L 3-gram2 on 3-gram3 do... similarity = #n-gram in comon #n-gram in total G Varoquaux 14
  27. 27. 2 Python implementation: DirtyCat DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import S i m i l a r i t y E n c o d e r s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’) t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 15
  28. 28. 2 Other approach: TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II G Varoquaux 16
  29. 29. 2 Other approach: TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager IIEmbedding closeby categories with the same y can help building a simple decision function. G Varoquaux 16
  30. 30. 2 Other approach: TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import TargetEncoder t a r g e t e n c o d e r = TargetEncoder () t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 16
  31. 31. 2 Experimental results: prediction performance Average rank on 7 datasets Linear model Gradient-boosted trees One-hot encoding 4.7 6.0 Target encoding 5.3 4.3 Similarity encoding Jaro-Winkler 3.4 3.6 Levenshtein 3.1 3.0 3-gram 1.1 1.9 Best: similarity encoding with 3-gram similarity [Cerda... 2018] Also, gradient-boosted trees work much better G Varoquaux 17
  32. 32. 2 Dirty categories blow up dimension Wow, lot’s of datasets! G Varoquaux 18
  33. 33. 2 Dirty categories blow up dimension New words in natural language Wow, lot’s of datasets! G Varoquaux 18
  34. 34. 2 Dirty categories blow up dimension New words in natural language Wow, lot’s of datasets! X ∈ Rn×p , p is large Statistical problems Computational problems G Varoquaux 18
  35. 35. 2 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? G Varoquaux 19
  36. 36. 2 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? All training-set ⇒ huge dimensionality Most frequent? Maybe the right prototypes /∈ training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 19
  37. 37. 2 n-grams grow, but there is redundancy Natural language G Varoquaux 20
  38. 38. 2 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 21
  39. 39. 2 Latent category model Topic model on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol G Varoquaux 22
  40. 40. 2 Latent category model Topic model on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 22
  41. 41. 2 String models of latent categories Encodings that extract latent categories library operator ecialist arehouse manager ommunity , rescue officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant nam es Categories G Varoquaux 23
  42. 42. 2 String models of latent categories Inferring plausible feature names untant, assistant, library nator, equipment, operator administration, specialist t, craftsworker, warehouse crossing, program, manager ician, mechanic, community refighter, rescuer, rescue ional, correction, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Inferred featurenam es Categories G Varoquaux 23
  43. 43. 2 Data science with dirty categories 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 24
  44. 44. 3 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 25
  45. 45. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA /∈ R More than an implementation problem G Varoquaux 26
  46. 46. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA /∈ R More than an implementation problem Categorical are discrete anyhow For missing values in categorical variables, create a special categorie ”missing”. Rest of talk on NA in numerical variables G Varoquaux 26
  47. 47. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). G Varoquaux 27
  48. 48. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable G Varoquaux 27
  49. 49. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 3 2 1 0 MNAR G Varoquaux 27
  50. 50. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 3 2 1 0 MNAR But There isn’t always an unobserved value Age of spouse of singles? We are not trying to maximize likelihoods G Varoquaux 27
  51. 51. The #$@! machine learning toolkit still doesn’t work?! G Varoquaux 28
  52. 52. 3 Imputation Fill in information Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA–2000 Social Worker IV M 07/16/2007 Police Officer III M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA–2012 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA–2014 Library Assistant I Large statistical literature Procedures and results focused on in sample settings How about completing the test set with the train set? What to do with the prediction target y? G Varoquaux 29
  53. 53. 3 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute G Varoquaux 30
  54. 54. 3 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer new in 0.21!! G Varoquaux 30
  55. 55. 3 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer new in 0.21!! Classic statistics point of view Mean imputation is dis- astrous, because it dis- orts the distribution 2 0 2 3 2 1 0 1 2 3 “Congeniality” conditions: good imputation must preserve data propeties used by later analysis steps G Varoquaux 30
  56. 56. 3 Imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time G Varoquaux 31
  57. 57. 3 Imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.65 0.70 0.75 0.80 r2score Mean Iterative Convergence 0.725 0.750 0.775 r2 score Iterative Mean Small small size Notebook: github – @nprost / supervised missing Conclusions: IterativeImputer is useful for small sample sizes G Varoquaux 31
  58. 58. 3 Imputation is not enough Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) G Varoquaux 32
  59. 59. 3 Imputation is not enough Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) Simulation: y depends indirectly on missingness censoring in the data 102 103 104 Sample size 0.75 0.80 0.85 0.90 0.95 r2score Mean Mean+ indicator Iterative Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Iterative Mean+ indicator Mean Small small size Notebook: github – @nprost / supervised missing Adding a mask is crucial Iterative imputation can be detrimental G Varoquaux 32
  60. 60. @GaelVaroquaux Learning on dirty data Prepare data via ColumnTransformer Use HistGradientBoosting
  61. 61. @GaelVaroquaux Learning on dirty data Prepare data via ColumnTransformer Use HistGradientBoosting Dirty categories Statistical modeling of non-curated categorical data Give us your dirty data Similarity encoding robust solution that enables statistical models Dirty category software: http://dirty-cat.github.io
  62. 62. @GaelVaroquaux Learning on dirty data Prepare data via ColumnTransformer Use HistGradientBoosting Dirty categories Give us your dirty data Similarity encoding Dirty category software: http://dirty-cat.github.io Supervised learning with missing data Mean imputation + missing indicator Much more results in [Josse... 2019] http://project.inria.fr/dirtydata On going research
  63. 63. Acknowledgements Dirty categories Patricio Cerda and Balazs Kegl Missing data Julie Josse, Erwan Scornet, Nicolas Prost Implementation in scikit-learn thanks to scikit-learn consortium partners
  64. 64. 4 References I P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for learning with dirty categorical variables. 2018. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1): 27–32, 2001. D. B. Rubin. Inference and missing data. Biometrika, 63(3): 581–592, 1976.

×