<p>Once an obscure branch of applied mathematics, machine learning is now the darling of tech. I will talk about lessons learned democratizing machine learning. How libraries like scikit-learn were designed to empower users: simplifying but avoiding ambiguous behaviors. How the Python data ecosystem was built from scientific computing tools: the importance of good numerics. How some machine-learning patterns easily provide value to real-world situations. I will also discuss remain challenges to address and the progresses that we are making. Scaling up brings different bottlenecks to numerics. Integrating data in the statistical models, a hurdle to data-science practice requires to rethink data cleaning pipelines.</p><p>This talk will drawn from my experience as a scikit-learn developer, but also as a researcher in machine learning and applications.</p>
6. scikit-learn
Data-science for the many, not only the mighty
Data scientists:
Largest data processed
Poll by KDnuggets
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20%
huge = 10 to 100GB
7. scikit-learn
Data-science for the many, not only the mighty
Data scientists:
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20% 2018
2016
2015
2014
2013
no increase with time
8. Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money
9. Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money
Databricks survey 90% organizations invest in AI, few succeed
Challenges reported:
98%: preparation and aggregation of large datasets
96%: data exploration and iterative model training
https://databricks.com/company/newsroom/press-releases/databricks-survey-gets-to-the-heart-of-the-ai-dilemma-nearly-90-of-organizations-investing-in-ai-very-few-succeeding
11. 1 Building a toolkit for all
The scikit-learn story
12. 1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
13. 1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)
14. 1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)
Enables bridging across languages (eg for lapack), Cython
15. 1 Focus on usability
API design
Grey box: all models interchangeable,
but still inspectable
Documentation & examples
Good documentation required to add a feature
Easy-understable examples guide API design
Teach statistical learning, rather than code
Models, solvers, hyperparameters
Choices that do not require tinkering
Lots of usecase-driven empirical testing
16. 1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
17. 1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
Open source has won
But it needs sustainability and investment
18. 1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
Open source has won
But it needs sustainability and investment
mid-2018: A foundation for scikit-learn
20. 2 Algorithmic improvements
PCA
Cost: np min(n,p)
Randomized PCA (simplified intuitions)
1 loop: take a random fraction of the data
2 small PCA on that fraction
3 aggregate results via PCA across results
svd_solver=’auto’ Up to ×10 speedup
21. 2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression
Gradient descent on error measure: wi+1 = wi +α∇wf
Large n = costly gradient computation
Full gradient
Costly
Sub-sampling in
gradient
Finnicky
Sub-sampling +
noise reduction
solver=’saga’
Fast & easy
22. 2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression sub-sampling + noise reduction
Gradient-boosted trees fit on sufficient summary
Succession of decision trees that enrich each other
Iteration 1 Iteration 2 Iteration 3
Speedup: bin data and compute histograms
HistGradientBoostingRegressor v0.21
catch up with XGBoost & lightgbm
23. 2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression sub-sampling + noise reduction
Gradient-boosted trees fit on sufficient summary
Fit on several subsamples / chunks
+ aggregation or variance reduction
Fit on summary statistics
26. 2 Scaling out: parallel computing
Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
03878794797927
27. 2 Scaling out: parallel computing
Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
Real life = A merry mess oversubscription, inefficient transfert
A scheduling problem But: need simple API to focus on algorithmics
scikit-learn is a library: doesn’t own the “main”
28. 2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)
29. 2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)
Extendable backend API (eg dask)
delegates scheduling (eg to a framework)
still a dispatch / receive queue
overflows the memory of greedy schedulers
30. 2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574
31. 2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574
Language-agnostic predictor representation ONNX
sklearn-onnx can convert trained models to other runtimes
Working on guaranteeing compliance
Useful to deployment to production
32. Scaling
Algorithmic improvement is top priority
External infrastructure helps scaling out
Tension with our mission of generic, reusable library
⇒ work on impedence matching layer
34. 3 Data assembly for statistics
“Dirty data” is a central problem
Merging data sources
Input errors
35. 3 Machine learning versus data in the wild
numbers (in arrays)
arrays (of numbers)
arrays
strings
databases
schemas
A gap between
statistics
&
data engineering
36. 3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
37. 3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
38. 3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
39. 3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
Missing values
40. 3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
Missing values
Non-normalized entries
41. 3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering
42. 3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering
Separating fitting from transforming
Can be applied to new data
Avoids data leakage
Model selection on dataframes
model = make_pipeline(column_trans,
HistGradientBoostingClassifier())
scores = cross_val_score(model, df, y)
Choose data-engineering operations to maximize prediction
43. 3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others
44. 3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others
For prediction
If y depends on missingness, perfect imputation breaks prediction
⇒ add a missing indicator: IterativeImputer(add_indicator=True)
With constant imputation a powerful learner can model missing values
On the consistency of supervised learning with missing values, J Josse et al, arXiv 2019
NA in HistGradientBoosting v0.22
45. 3 Encoding dirty categories
Digression: not in scikit-learn
One-hot encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Policer Officer II ... 0 0 1
Policer Oficer II ... 0 1 0
Policer Officer I ... 1 0 0
X ∈ Rn×p p grows fast
new categories?
link categories?
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
Bus Operator
Bus ::::::::::
Opperator
Electrician
Library Assistant I
Social Work IV
Library Manager
46. 3 Encoding dirty categories
Digression: not in scikit-learn
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
Traditional view:
Data cleaning,
feature engineering
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
...
47. 3 Encoding dirty categories https://project.inria.fr/dirtydata
Digression: not in scikit-learn
Similarity encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Police Officer II ... 0.9 0.8 1
Police Oficer II ... 0.8 1 0.9
Police Officer I ... 1 0.9 0.8
string_distance(Police Officer II, Police Oficer II)
https://dirty-cat.github.io
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
48. 3 Encoding dirty categories https://project.inria.fr/dirtydata
Digression: not in scikit-learn Modeling substrings
ssistant,
library
uipment,
operator
ation,
specialist
worker,
warehouse
program,
manager
chanic,
community
,
rescuer,
rescue
rrection,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
ed
featurenam
es
Categories
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
49. @GaelVaroquaux
Democratizing machine learning
Machine learning for everyone
– from beginner to expert
Agile development, good numerics, collaboration & user focus
Scalability via light coupling to infrastructure and ecosystem
Ongoing research on machine learning with dirty data
Sustainability via sponsors