Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Democratizing machine learning: perspective from scikit-learn

3.434 Aufrufe

Veröffentlicht am

Once an obscure branch of applied mathematics, machine learning is now the darling of tech. This talk (https://www.youtube.com/watch?v=E6KpRwoH18M) discusses lessons learned democratizing machine learning. How scikit-learn was designed to empower users from a community of developers. How the Python data ecosystem was built from scientific computing tools: the importance of good numerics. It also covers remain challenges to address and the progresses that we are making. Scaling out brings different bottlenecks to numerics. Integrating data in the statistical models, a hurdle to data-science practice requires to rethink data cleaning pipelines. This talk draws from my experience as a scikit-learn developer, but also as a researcher in machine learning and applications.

1.Building a toolkit for all
2. Tackling scalability
3. Bridging to data engineering

Veröffentlicht in: Technologie
  • Advertisers run several ad campaigns across multiple websites and mobile apps. These ad campaigns' KPIs need to be proactively monitored and optimized to increase their ROI. Hence, we have built our own automated campaign data anomaly detection system using machine learning. This system will help spot data anomalies in campaign performance data for thousands of campaigns dailyAdvertisers run several ad campaigns across multiple websites and mobile apps. These ad campaigns' KPIs need to be proactively monitored and optimized to increase their ROI. Hence, we have built our own automated campaign data anomaly detection system using machine learning. This system will help spot data anomalies in campaign performance data for thousands of campaigns daily http://bit.ly/2N5Z6kh
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Democratizing machine learning: perspective from scikit-learn

  1. 1. Democratizing machine learning: perspective from scikit-learn Gaël Varoquaux, scikit machine learning in Python
  2. 2. scikit-learn From nerds
  3. 3. scikit-learn From nerds to an industry standard Number of monthly users 2010 2012 2014 2016 2018 200000 400000 600000 800000
  4. 4. scikit-learn We were not aiming for the enterprise but rather ourselves scientists students
  5. 5. scikit-learn Data-science for the many, not only the mighty The news:
  6. 6. scikit-learn Data-science for the many, not only the mighty Data scientists: Largest data processed Poll by KDnuggets lessthan1MB 1.1to10MB 11to100MB 101MBto1GB 1.1to10GB 11to100GB 101GBto1TB 1.1to10TB 11to100TB 101TBto1PB 1.1to10PB 11to100PB over100PB 0% 10% 20% huge = 10 to 100GB
  7. 7. scikit-learn Data-science for the many, not only the mighty Data scientists: lessthan1MB 1.1to10MB 11to100MB 101MBto1GB 1.1to10GB 11to100GB 101GBto1TB 1.1to10TB 11to100TB 101TBto1PB 1.1to10PB 11to100PB over100PB 0% 10% 20% 2018 2016 2015 2014 2013 no increase with time
  8. 8. Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster 1. Dirty data 2. Talent 3. Money
  9. 9. Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster 1. Dirty data 2. Talent 3. Money Databricks survey 90% organizations invest in AI, few succeed Challenges reported: 98%: preparation and aggregation of large datasets 96%: data exploration and iterative model training https://databricks.com/company/newsroom/press-releases/databricks-survey-gets-to-the-heart-of-the-ai-dilemma-nearly-90-of-organizations-investing-in-ai-very-few-succeeding
  10. 10. Democratizing machine learning 1 Building a toolkit for all 2 Tackling scalability 3 Bridging to data engineering
  11. 11. 1 Building a toolkit for all The scikit-learn story
  12. 12. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory)
  13. 13. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory) numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations Continuous-memory model (float*)
  14. 14. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory) numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations Continuous-memory model (float*) Enables bridging across languages (eg for lapack), Cython
  15. 15. 1 Focus on usability API design Grey box: all models interchangeable, but still inspectable Documentation & examples Good documentation required to add a feature Easy-understable examples guide API design Teach statistical learning, rather than code Models, solvers, hyperparameters Choices that do not require tinkering Lots of usecase-driven empirical testing
  16. 16. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them
  17. 17. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment
  18. 18. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment mid-2018: A foundation for scikit-learn + the community
  19. 19. 2 Tackling scalability A new challenge
  20. 20. 2 Algorithmic improvements PCA Cost: np min(n,p) Randomized PCA (simplified intuitions) 1 loop: take a random fraction of the data 2 small PCA on that fraction 3 aggregate results via PCA across results svd_solver=’auto’ Up to ×10 speedup
  21. 21. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression Gradient descent on error measure: wi+1 = wi +α∇wf Large n = costly gradient computation Full gradient Costly Sub-sampling in gradient Finnicky Sub-sampling + noise reduction solver=’saga’ Fast & easy
  22. 22. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression sub-sampling + noise reduction Gradient-boosted trees fit on sufficient summary Succession of decision trees that enrich each other Iteration 1 Iteration 2 Iteration 3 Speedup: bin data and compute histograms HistGradientBoostingRegressor v0.21 catch up with XGBoost & lightgbm
  23. 23. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression sub-sampling + noise reduction Gradient-boosted trees fit on sufficient summary Fit on several subsamples / chunks + aggregation or variance reduction Fit on summary statistics
  24. 24. 2 Scaling out: parallel computing Simple parallel computing schemes limiting data transfer Data parallel 03878794797927 03878794797927 mostly in inner loops Model parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 for model selection eg GridSearchCV
  25. 25. 2 Scaling out: parallel computing Simple parallel computing schemes limiting data transfer Data parallel 03878794797927 03878794797927 mostly in inner loops Model parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 for model selection eg GridSearchCV Real-life machine-learning 03878794797927
  26. 26. 2 Scaling out: parallel computing Implementations (used by scikit-learn) Inner loops (fast) OS threads OpenMP (from GCC, ICC, clang) all in same process Large-scale parallelism Across Python VMs, Across computers Transfer Synchronization 03878794797927
  27. 27. 2 Scaling out: parallel computing Implementations (used by scikit-learn) Inner loops (fast) OS threads OpenMP (from GCC, ICC, clang) all in same process Large-scale parallelism Across Python VMs, Across computers Transfer Synchronization Real life = A merry mess oversubscription, inefficient transfert A scheduling problem But: need simple API to focus on algorithmics scikit-learn is a library: doesn’t own the “main”
  28. 28. 2 Our abstraction: joblib’s parallel for joblib.Parallel()(joblib.delayed(f)(i) for i in ...) lazy evaluation Multiprocessing / loky backend manages a pool of Python VMs segfault resilient lazy loop consumption to limit memory usage auto-bunching dispatch to lower overhead limits # threads in sub-process (threadpoolctl)
  29. 29. 2 Our abstraction: joblib’s parallel for joblib.Parallel()(joblib.delayed(f)(i) for i in ...) lazy evaluation Multiprocessing / loky backend manages a pool of Python VMs segfault resilient lazy loop consumption to limit memory usage auto-bunching dispatch to lower overhead limits # threads in sub-process (threadpoolctl) Extendable backend API (eg dask) delegates scheduling (eg to a framework) still a dispatch / receive queue overflows the memory of greedy schedulers
  30. 30. 2 Better serialization for better scaling Serializing arbitrary Python objects cloudpickle eg dispatch estimators across the network Python 3.8 improve- ments Subclassable C persister ⇒ Much faster Out of band serialization ⇒ no memory copies when serializing numpy & arrow PEP 574
  31. 31. 2 Better serialization for better scaling Serializing arbitrary Python objects cloudpickle eg dispatch estimators across the network Python 3.8 improve- ments Subclassable C persister ⇒ Much faster Out of band serialization ⇒ no memory copies when serializing numpy & arrow PEP 574 Language-agnostic predictor representation ONNX sklearn-onnx can convert trained models to other runtimes Working on guaranteeing compliance Useful to deployment to production
  32. 32. Scaling Algorithmic improvement is top priority External infrastructure helps scaling out Tension with our mission of generic, reusable library ⇒ work on impedence matching layer
  33. 33. 3 Bridging to data engineering
  34. 34. 3 Data assembly for statistics “Dirty data” is a central problem Merging data sources Input errors
  35. 35. 3 Machine learning versus data in the wild numbers (in arrays) arrays (of numbers) arrays strings databases schemas A gap between statistics & data engineering
  36. 36. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array
  37. 37. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I
  38. 38. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data
  39. 39. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data Missing values
  40. 40. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data Missing values Non-normalized entries
  41. 41. 3 Ingesting heterogeneous data: the column transformer Applies different transformers to columns These can be complex pipelines column_trans = compose.make_column_transformer( (one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]), ( date_trans , ’ Date F i r s t Hired ’ ), ) X = column_trans . f i t _ t r a n s f o r m ( df ) Dataframe in, array out with heterogeneous preprocessing & feature engineering
  42. 42. 3 Ingesting heterogeneous data: the column transformer Applies different transformers to columns These can be complex pipelines column_trans = compose.make_column_transformer( (one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]), ( date_trans , ’ Date F i r s t Hired ’ ), ) X = column_trans . f i t _ t r a n s f o r m ( df ) Dataframe in, array out with heterogeneous preprocessing & feature engineering Separating fitting from transforming Can be applied to new data Avoids data leakage Model selection on dataframes model = make_pipeline(column_trans, HistGradientBoostingClassifier()) scores = cross_val_score(model, df, y) Choose data-engineering operations to maximize prediction
  43. 43. 3 Machine learning with missing data Imputation replace NA by plausible values Constant imputation sklearn.impute.SimpleImpute Replace by mean of feature Conditional imputation v0.21 sklearn.impute.IterativeImputer Feature as functions of others
  44. 44. 3 Machine learning with missing data Imputation replace NA by plausible values Constant imputation sklearn.impute.SimpleImpute Replace by mean of feature Conditional imputation v0.21 sklearn.impute.IterativeImputer Feature as functions of others For prediction If y depends on missingness, perfect imputation breaks prediction ⇒ add a missing indicator: IterativeImputer(add_indicator=True) With constant imputation a powerful learner can model missing values On the consistency of supervised learning with missing values, J Josse et al, arXiv 2019 NA in HistGradientBoosting v0.22
  45. 45. 3 Encoding dirty categories Digression: not in scikit-learn One-hot encoding ... Police O fficer I Police O ficer II Police O fficer II Policer Officer II ... 0 0 1 Policer Oficer II ... 0 1 0 Policer Officer I ... 1 0 0 X ∈ Rn×p p grows fast new categories? link categories? Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II Bus Operator Bus :::::::::: Opperator Electrician Library Assistant I Social Work IV Library Manager
  46. 46. 3 Encoding dirty categories Digression: not in scikit-learn Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ... Traditional view: Data cleaning, feature engineering Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III ...
  47. 47. 3 Encoding dirty categories https://project.inria.fr/dirtydata Digression: not in scikit-learn Similarity encoding ... Police O fficer I Police O ficer II Police O fficer II Police Officer II ... 0.9 0.8 1 Police Oficer II ... 0.8 1 0.9 Police Officer I ... 1 0.9 0.8 string_distance(Police Officer II, Police Oficer II) https://dirty-cat.github.io Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ...
  48. 48. 3 Encoding dirty categories https://project.inria.fr/dirtydata Digression: not in scikit-learn Modeling substrings ssistant, library uipment, operator ation, specialist worker, warehouse program, manager chanic, community , rescuer, rescue rrection, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant ed featurenam es Categories Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ...
  49. 49. @GaelVaroquaux Democratizing machine learning Machine learning for everyone – from beginner to expert Agile development, good numerics, collaboration & user focus Scalability via light coupling to infrastructure and ecosystem Ongoing research on machine learning with dirty data Sustainability: community + sponsors

×