SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Democratizing machine learning:
perspective from a scikit-learn creator
Gaël Varoquaux,
scikit
machine learning in Python
scikit-learn
From nerds
scikit-learn
From nerds to an industry standard
Number of monthly users
2010 2012 2014 2016 2018
200000
400000
600000
800000
scikit-learn
We were not aiming for the enterprise
but rather
ourselves
scientists
students
scikit-learn
Data-science for the many, not only the mighty
The news:
scikit-learn
Data-science for the many, not only the mighty
Data scientists:
Largest data processed
Poll by KDnuggets
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20%
huge = 10 to 100GB
scikit-learn
Data-science for the many, not only the mighty
Data scientists:
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20% 2018
2016
2015
2014
2013
no increase with time
Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money
Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money
Databricks survey 90% organizations invest in AI, few succeed
Challenges reported:
98%: preparation and aggregation of large datasets
96%: data exploration and iterative model training
https://databricks.com/company/newsroom/press-releases/databricks-survey-gets-to-the-heart-of-the-ai-dilemma-nearly-90-of-organizations-investing-in-ai-very-few-succeeding
Democratizing machine learning
1 Building a toolkit for all
2 Tackling scalability
3 Bridging to data engineering
1 Building a toolkit for all
The scikit-learn story
1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)
1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)
Enables bridging across languages (eg for lapack), Cython
1 Focus on usability
API design
Grey box: all models interchangeable,
but still inspectable
Documentation & examples
Good documentation required to add a feature
Easy-understable examples guide API design
Teach statistical learning, rather than code
Models, solvers, hyperparameters
Choices that do not require tinkering
Lots of usecase-driven empirical testing
1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
Open source has won
But it needs sustainability and investment
1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
Open source has won
But it needs sustainability and investment
mid-2018: A foundation for scikit-learn
2 Tackling scalability
A new challenge
2 Algorithmic improvements
PCA
Cost: np min(n,p)
Randomized PCA (simplified intuitions)
1 loop: take a random fraction of the data
2 small PCA on that fraction
3 aggregate results via PCA across results
svd_solver=’auto’ Up to ×10 speedup
2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression
Gradient descent on error measure: wi+1 = wi +α∇wf
Large n = costly gradient computation
Full gradient
Costly
Sub-sampling in
gradient
Finnicky
Sub-sampling +
noise reduction
solver=’saga’
Fast & easy
2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression sub-sampling + noise reduction
Gradient-boosted trees fit on sufficient summary
Succession of decision trees that enrich each other
Iteration 1 Iteration 2 Iteration 3
Speedup: bin data and compute histograms
HistGradientBoostingRegressor v0.21
catch up with XGBoost & lightgbm
2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression sub-sampling + noise reduction
Gradient-boosted trees fit on sufficient summary
Fit on several subsamples / chunks
+ aggregation or variance reduction
Fit on summary statistics
2 Scaling out: parallel computing
Simple parallel computing schemes limiting data transfer
Data parallel
03878794797927
03878794797927
mostly in inner loops
Model parallel
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
for model selection
eg GridSearchCV
2 Scaling out: parallel computing
Simple parallel computing schemes limiting data transfer
Data parallel
03878794797927
03878794797927
mostly in inner loops
Model parallel
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
for model selection
eg GridSearchCV
Real-life machine-learning
03878794797927
2 Scaling out: parallel computing
Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
03878794797927
2 Scaling out: parallel computing
Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
Real life = A merry mess oversubscription, inefficient transfert
A scheduling problem But: need simple API to focus on algorithmics
scikit-learn is a library: doesn’t own the “main”
2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)
2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)
Extendable backend API (eg dask)
delegates scheduling (eg to a framework)
still a dispatch / receive queue
overflows the memory of greedy schedulers
2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574
2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574
Language-agnostic predictor representation ONNX
sklearn-onnx can convert trained models to other runtimes
Working on guaranteeing compliance
Useful to deployment to production
Scaling
Algorithmic improvement is top priority
External infrastructure helps scaling out
Tension with our mission of generic, reusable library
⇒ work on impedence matching layer
3 Bridging to data engineering
3 Data assembly for statistics
“Dirty data” is a central problem
Merging data sources
Input errors
3 Machine learning versus data in the wild
numbers (in arrays)
arrays (of numbers)
arrays
strings
databases
schemas
A gap between
statistics
&
data engineering
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
Missing values
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
Missing values
Non-normalized entries
3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering
3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering
Separating fitting from transforming
Can be applied to new data
Avoids data leakage
Model selection on dataframes
model = make_pipeline(column_trans,
HistGradientBoostingClassifier())
scores = cross_val_score(model, df, y)
Choose data-engineering operations to maximize prediction
3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others
3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others
For prediction
If y depends on missingness, perfect imputation breaks prediction
⇒ add a missing indicator: IterativeImputer(add_indicator=True)
With constant imputation a powerful learner can model missing values
On the consistency of supervised learning with missing values, J Josse et al, arXiv 2019
NA in HistGradientBoosting v0.22
3 Encoding dirty categories
Digression: not in scikit-learn
One-hot encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Policer Officer II ... 0 0 1
Policer Oficer II ... 0 1 0
Policer Officer I ... 1 0 0
X ∈ Rn×p p grows fast
new categories?
link categories?
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
Bus Operator
Bus ::::::::::
Opperator
Electrician
Library Assistant I
Social Work IV
Library Manager
3 Encoding dirty categories
Digression: not in scikit-learn
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
Traditional view:
Data cleaning,
feature engineering
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
...
3 Encoding dirty categories https://project.inria.fr/dirtydata
Digression: not in scikit-learn
Similarity encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Police Officer II ... 0.9 0.8 1
Police Oficer II ... 0.8 1 0.9
Police Officer I ... 1 0.9 0.8
string_distance(Police Officer II, Police Oficer II)
https://dirty-cat.github.io
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
3 Encoding dirty categories https://project.inria.fr/dirtydata
Digression: not in scikit-learn Modeling substrings
ssistant,
library
uipment,
operator
ation,
specialist
worker,
warehouse
program,
manager
chanic,
community
,
rescuer,
rescue
rrection,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
ed
featurenam
es
Categories
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
@GaelVaroquaux
Democratizing machine learning
Machine learning for everyone
– from beginner to expert
Agile development, good numerics, collaboration & user focus
Scalability via light coupling to infrastructure and ecosystem
Ongoing research on machine learning with dirty data
Sustainability via sponsors

Weitere ähnliche Inhalte

Was ist angesagt?

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesDatabricks
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at WalgreensDataWorks Summit
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsBlue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsDatabricks
 
The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi Databricks
 
Seeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverSeeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverInside Analysis
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Big Data Spain
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsStamatis Zampetakis
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 

Was ist angesagt? (20)

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsBlue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
 
The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi
 
Seeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverSeeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing Forever
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 

Ähnlich wie Democratizing ML: scikit-learn creator perspective

How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy Hussain Sultan
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsJason Riedy
 
Borys Pratsiuk "How to be NVidia partner"
Borys Pratsiuk "How to be NVidia partner"Borys Pratsiuk "How to be NVidia partner"
Borys Pratsiuk "How to be NVidia partner"Lviv Startup Club
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User StudyEnrico Daga
 
Fighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial IntelligenceFighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial IntelligenceDataWorks Summit
 
DNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdataDNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdataRolf Koski
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Big Data Spain
 
Streaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of ThingsStreaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of ThingsDatawatchCorporation
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsNeo4j
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life RevolutionCapgemini
 
Predictive Analytics: Why (I)IoT Is Different
Predictive Analytics: Why (I)IoT Is DifferentPredictive Analytics: Why (I)IoT Is Different
Predictive Analytics: Why (I)IoT Is DifferentAltoros
 
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics Dell Statisti...
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics  Dell Statisti...BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics  Dell Statisti...
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics Dell Statisti...Big Data Week
 

Ähnlich wie Democratizing ML: scikit-learn creator perspective (20)

How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
Borys Pratsiuk "How to be NVidia partner"
Borys Pratsiuk "How to be NVidia partner"Borys Pratsiuk "How to be NVidia partner"
Borys Pratsiuk "How to be NVidia partner"
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Big Data
Big DataBig Data
Big Data
 
Fighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial IntelligenceFighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial Intelligence
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
DNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdataDNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdata
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Streaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of ThingsStreaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of Things
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
Predictive Analytics: Why (I)IoT Is Different
Predictive Analytics: Why (I)IoT Is DifferentPredictive Analytics: Why (I)IoT Is Different
Predictive Analytics: Why (I)IoT Is Different
 
Applying Big Data
Applying Big DataApplying Big Data
Applying Big Data
 
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics Dell Statisti...
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics  Dell Statisti...BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics  Dell Statisti...
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics Dell Statisti...
 
Big Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFOBig Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFO
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Kürzlich hochgeladen (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Democratizing ML: scikit-learn creator perspective

  • 1. Democratizing machine learning: perspective from a scikit-learn creator Gaël Varoquaux, scikit machine learning in Python
  • 3. scikit-learn From nerds to an industry standard Number of monthly users 2010 2012 2014 2016 2018 200000 400000 600000 800000
  • 4. scikit-learn We were not aiming for the enterprise but rather ourselves scientists students
  • 5. scikit-learn Data-science for the many, not only the mighty The news:
  • 6. scikit-learn Data-science for the many, not only the mighty Data scientists: Largest data processed Poll by KDnuggets lessthan1MB 1.1to10MB 11to100MB 101MBto1GB 1.1to10GB 11to100GB 101GBto1TB 1.1to10TB 11to100TB 101TBto1PB 1.1to10PB 11to100PB over100PB 0% 10% 20% huge = 10 to 100GB
  • 7. scikit-learn Data-science for the many, not only the mighty Data scientists: lessthan1MB 1.1to10MB 11to100MB 101MBto1GB 1.1to10GB 11to100GB 101GBto1TB 1.1to10TB 11to100TB 101TBto1PB 1.1to10PB 11to100PB over100PB 0% 10% 20% 2018 2016 2015 2014 2013 no increase with time
  • 8. Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster 1. Dirty data 2. Talent 3. Money
  • 9. Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster 1. Dirty data 2. Talent 3. Money Databricks survey 90% organizations invest in AI, few succeed Challenges reported: 98%: preparation and aggregation of large datasets 96%: data exploration and iterative model training https://databricks.com/company/newsroom/press-releases/databricks-survey-gets-to-the-heart-of-the-ai-dilemma-nearly-90-of-organizations-investing-in-ai-very-few-succeeding
  • 10. Democratizing machine learning 1 Building a toolkit for all 2 Tackling scalability 3 Bridging to data engineering
  • 11. 1 Building a toolkit for all The scikit-learn story
  • 12. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory)
  • 13. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory) numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations Continuous-memory model (float*)
  • 14. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory) numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations Continuous-memory model (float*) Enables bridging across languages (eg for lapack), Cython
  • 15. 1 Focus on usability API design Grey box: all models interchangeable, but still inspectable Documentation & examples Good documentation required to add a feature Easy-understable examples guide API design Teach statistical learning, rather than code Models, solvers, hyperparameters Choices that do not require tinkering Lots of usecase-driven empirical testing
  • 16. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them
  • 17. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment
  • 18. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment mid-2018: A foundation for scikit-learn
  • 19. 2 Tackling scalability A new challenge
  • 20. 2 Algorithmic improvements PCA Cost: np min(n,p) Randomized PCA (simplified intuitions) 1 loop: take a random fraction of the data 2 small PCA on that fraction 3 aggregate results via PCA across results svd_solver=’auto’ Up to ×10 speedup
  • 21. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression Gradient descent on error measure: wi+1 = wi +α∇wf Large n = costly gradient computation Full gradient Costly Sub-sampling in gradient Finnicky Sub-sampling + noise reduction solver=’saga’ Fast & easy
  • 22. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression sub-sampling + noise reduction Gradient-boosted trees fit on sufficient summary Succession of decision trees that enrich each other Iteration 1 Iteration 2 Iteration 3 Speedup: bin data and compute histograms HistGradientBoostingRegressor v0.21 catch up with XGBoost & lightgbm
  • 23. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression sub-sampling + noise reduction Gradient-boosted trees fit on sufficient summary Fit on several subsamples / chunks + aggregation or variance reduction Fit on summary statistics
  • 24. 2 Scaling out: parallel computing Simple parallel computing schemes limiting data transfer Data parallel 03878794797927 03878794797927 mostly in inner loops Model parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 for model selection eg GridSearchCV
  • 25. 2 Scaling out: parallel computing Simple parallel computing schemes limiting data transfer Data parallel 03878794797927 03878794797927 mostly in inner loops Model parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 for model selection eg GridSearchCV Real-life machine-learning 03878794797927
  • 26. 2 Scaling out: parallel computing Implementations (used by scikit-learn) Inner loops (fast) OS threads OpenMP (from GCC, ICC, clang) all in same process Large-scale parallelism Across Python VMs, Across computers Transfer Synchronization 03878794797927
  • 27. 2 Scaling out: parallel computing Implementations (used by scikit-learn) Inner loops (fast) OS threads OpenMP (from GCC, ICC, clang) all in same process Large-scale parallelism Across Python VMs, Across computers Transfer Synchronization Real life = A merry mess oversubscription, inefficient transfert A scheduling problem But: need simple API to focus on algorithmics scikit-learn is a library: doesn’t own the “main”
  • 28. 2 Our abstraction: joblib’s parallel for joblib.Parallel()(joblib.delayed(f)(i) for i in ...) lazy evaluation Multiprocessing / loky backend manages a pool of Python VMs segfault resilient lazy loop consumption to limit memory usage auto-bunching dispatch to lower overhead limits # threads in sub-process (threadpoolctl)
  • 29. 2 Our abstraction: joblib’s parallel for joblib.Parallel()(joblib.delayed(f)(i) for i in ...) lazy evaluation Multiprocessing / loky backend manages a pool of Python VMs segfault resilient lazy loop consumption to limit memory usage auto-bunching dispatch to lower overhead limits # threads in sub-process (threadpoolctl) Extendable backend API (eg dask) delegates scheduling (eg to a framework) still a dispatch / receive queue overflows the memory of greedy schedulers
  • 30. 2 Better serialization for better scaling Serializing arbitrary Python objects cloudpickle eg dispatch estimators across the network Python 3.8 improve- ments Subclassable C persister ⇒ Much faster Out of band serialization ⇒ no memory copies when serializing numpy & arrow PEP 574
  • 31. 2 Better serialization for better scaling Serializing arbitrary Python objects cloudpickle eg dispatch estimators across the network Python 3.8 improve- ments Subclassable C persister ⇒ Much faster Out of band serialization ⇒ no memory copies when serializing numpy & arrow PEP 574 Language-agnostic predictor representation ONNX sklearn-onnx can convert trained models to other runtimes Working on guaranteeing compliance Useful to deployment to production
  • 32. Scaling Algorithmic improvement is top priority External infrastructure helps scaling out Tension with our mission of generic, reusable library ⇒ work on impedence matching layer
  • 33. 3 Bridging to data engineering
  • 34. 3 Data assembly for statistics “Dirty data” is a central problem Merging data sources Input errors
  • 35. 3 Machine learning versus data in the wild numbers (in arrays) arrays (of numbers) arrays strings databases schemas A gap between statistics & data engineering
  • 36. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array
  • 37. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I
  • 38. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data
  • 39. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data Missing values
  • 40. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data Missing values Non-normalized entries
  • 41. 3 Ingesting heterogeneous data: the column transformer Applies different transformers to columns These can be complex pipelines column_trans = compose.make_column_transformer( (one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]), ( date_trans , ’ Date F i r s t Hired ’ ), ) X = column_trans . f i t _ t r a n s f o r m ( df ) Dataframe in, array out with heterogeneous preprocessing & feature engineering
  • 42. 3 Ingesting heterogeneous data: the column transformer Applies different transformers to columns These can be complex pipelines column_trans = compose.make_column_transformer( (one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]), ( date_trans , ’ Date F i r s t Hired ’ ), ) X = column_trans . f i t _ t r a n s f o r m ( df ) Dataframe in, array out with heterogeneous preprocessing & feature engineering Separating fitting from transforming Can be applied to new data Avoids data leakage Model selection on dataframes model = make_pipeline(column_trans, HistGradientBoostingClassifier()) scores = cross_val_score(model, df, y) Choose data-engineering operations to maximize prediction
  • 43. 3 Machine learning with missing data Imputation replace NA by plausible values Constant imputation sklearn.impute.SimpleImpute Replace by mean of feature Conditional imputation v0.21 sklearn.impute.IterativeImputer Feature as functions of others
  • 44. 3 Machine learning with missing data Imputation replace NA by plausible values Constant imputation sklearn.impute.SimpleImpute Replace by mean of feature Conditional imputation v0.21 sklearn.impute.IterativeImputer Feature as functions of others For prediction If y depends on missingness, perfect imputation breaks prediction ⇒ add a missing indicator: IterativeImputer(add_indicator=True) With constant imputation a powerful learner can model missing values On the consistency of supervised learning with missing values, J Josse et al, arXiv 2019 NA in HistGradientBoosting v0.22
  • 45. 3 Encoding dirty categories Digression: not in scikit-learn One-hot encoding ... Police O fficer I Police O ficer II Police O fficer II Policer Officer II ... 0 0 1 Policer Oficer II ... 0 1 0 Policer Officer I ... 1 0 0 X ∈ Rn×p p grows fast new categories? link categories? Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II Bus Operator Bus :::::::::: Opperator Electrician Library Assistant I Social Work IV Library Manager
  • 46. 3 Encoding dirty categories Digression: not in scikit-learn Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ... Traditional view: Data cleaning, feature engineering Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III ...
  • 47. 3 Encoding dirty categories https://project.inria.fr/dirtydata Digression: not in scikit-learn Similarity encoding ... Police O fficer I Police O ficer II Police O fficer II Police Officer II ... 0.9 0.8 1 Police Oficer II ... 0.8 1 0.9 Police Officer I ... 1 0.9 0.8 string_distance(Police Officer II, Police Oficer II) https://dirty-cat.github.io Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ...
  • 48. 3 Encoding dirty categories https://project.inria.fr/dirtydata Digression: not in scikit-learn Modeling substrings ssistant, library uipment, operator ation, specialist worker, warehouse program, manager chanic, community , rescuer, rescue rrection, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant ed featurenam es Categories Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ...
  • 49. @GaelVaroquaux Democratizing machine learning Machine learning for everyone – from beginner to expert Agile development, good numerics, collaboration & user focus Scalability via light coupling to infrastructure and ecosystem Ongoing research on machine learning with dirty data Sustainability via sponsors