Building machine learning systems remains something of an art, from gathering and transforming the right data to selecting and finetuning the most fitting modeling techniques. If we want to make machine learning more accessible and foster skilfull use, we need novel ways to share and reuse findings, and streamline online collaboration. OpenML is an open science platform for machine learning, allowing anyone to easily share data sets, code, and experiments, and collaborate with people all over the world to build better models. It shows, for any known data set, which are the best models, who built them, and how to reproduce and reuse them in different ways. It is readily integrated into several machine learning environments, so that you can share results with the touch of a button or a line of code. As such, it enables large-scale, real-time collaboration, allowing anyone to explore, build on, and contribute to the combined knowledge of the field. Ultimately, this provides a wealth of information for a novel, data-driven approach to machine learning, where we learn from millions of previous experiments to either assist people while analyzing data (e.g., which modeling techniques will likely work well and why), or automate the process altogether.
1. OpenML
D E M O C R AT I Z I N G A N D A U T O M AT I N G
M A C H I N E L E A R N I N G
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 7
2. OpenML
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 6
@open_ml
Be a part of this talk:
- Log in / create an account on www.openml.org
- You also need a GitHub account
- Click the or icon
- Click ‘Launch Demo’
- Launches notebook hosted by YSDA
3. Democratization
Empower anybody to do data science well
Provide easy access to reproducible data/code/models
Build easily on prior work
Allow frictionless collaboration
Crowdsource hard problems, share results easily
Organized, reproducible, open results
Simplify, speed up data science process
Automate drudge work (e.g. hyperparameter optimization)
Learn from past, build ever better algorithms, models
4. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
5. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
6. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
7. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
8. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
9. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
10. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
11. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
12. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
13. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
14. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
15. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
16. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
17. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
18. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E I N R E A L T I M E
19. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E I N R E A L T I M E
20. C O L L A B O R AT I V E M A C H I N E L E A R N I N G
Easy to use: Integrated in many ML tools/environments
Easy to contribute: Automated sharing of data, code, results
Organized data: Meta-data, reproducible models, link to people
Reward structure: Track your impact, build reputation
Self-learning: Learn from many experiments to help people
OpenML
21. Data (ARFF) uploaded or referenced (URL)
auto-versioned, analysed, meta-data
extracted, organised online
37. Studies (e-papers)
Online counterpart of a paper, chat/comments
Linked to GitHub repo, Jupyter notebooks
Circles
Create collaborations with trusted researchers
Code submissions
Sharing versioned code, docker images, archiving
GitHub integration
Collaboration tools (in progress)
Collaborative challenges
Rapid Analytics and Model Prototyping
Code submission, streamlined collaboration
38. Classroom challenges
Results appear in real-time, full details available immediately
Students can learn from each other, but also have to think to do better
Simple resubmissions do not count
Hidden test set is also possible
39. Classroom challenges
Results appear in real-time, full details available immediately
Students can learn from each other, but also have to think to do better
Simple resubmissions do not count
Hidden test set is also possible
40. Hyperparameter optimization challenge
Every iteration can be uploaded, prior
ones downloaded
Realtime challenges (coming up)
Predict energy usage in smart energy
API streams data, predictions uploaded
hourly
43. Predicting workflows with MetaLearning
FCFP4 fingerprint
(1024 features)
Problem space P
46 QSAR dataset properties
Meta-feature extraction
Meta-feature space F
452 target (protein) properties
108 (6*18) workflows
(feature gen + preproc + learner)
meta-features -> RMSE(workflow)
Performance space Y
Meta-learner
(RandomForest)
Select workflow
Evaluate
workflow
2764 QSAR datasets / targets
Feature
generation
molecular prop.
(1447 features)
FS = Feature selection
MVI = Missing value imputation
16584 (6*2764) QSAR datasets
train
predict
18 regression algorithms
Algorithm space A
(linear, forests, kNN, SVM,…)
basic prop.
(43 features)
Feature
generation
Preprocessing:
MVI, MVI+FS, none
Preprocessing:
MVI, MVI+FS, none
1024 aggregated fingerprints
936 target family distances
44. OpenML in drug discovery
I. Olier et al. (2017) Machine Learning (forthcoming)
Random forest (black) vs meta-learner (grey)
Meta-learner:
mRF: Random forest
k-NN: kNN
45. OpenML in drug discovery
We just scratched the surface. All data is available on OpenML.
Multi-target learning: if few drugs are tested on
a given target, include data on ‘related’ targets:
STL: standard (Random Forests)
Setting 1: Distance is taxonomy (ChEMBL)
Setting 2: Pairwise sequence alignment
46. Data-driven modelling
Learn how to build models based on many prior experiments
AI-human interaction
Bots/services to simplify work
Find data, construct pipelines, optimize hyperparameters,…
Automating machine learning
52. (new) data Extract meta-data
- (domain-specific) meta-
features
- detect outliers, missing
values,…
- Recommend primitives
Construct pipelines
- from (learned) templates
- generative methods
- genetic programming
- by humans
Optimize pipelines
- Different strategies
- random, model-based, bandits,…
- Learn from priors (meta-learning)
- Upload all results
Meta-learning
- find similar datasets
- predict runtime, accuracy
- predict hyperparameters
- good ranges, defaults
- trained meta-models
- recommend pipeline
components
Automating machine learning: a
human-robot symbiosis
53. OpenML
(new) data
datasets,
pipelines,
meta-data
meta-models
Extract meta-data
- (domain-specific) meta-
features
- detect outliers, missing
values,…
- Recommend primitives
Construct pipelines
- from (learned) templates
- generative methods
- genetic programming
- by humans
Optimize pipelines
- Different strategies
- random, model-based, bandits,…
- Learn from priors (meta-learning)
- Upload all results
Meta-learning
- find similar datasets
- predict runtime, accuracy
- predict hyperparameters
- good ranges, defaults
- trained meta-models
- recommend pipeline
components
Automating machine learning: a
human-robot symbiosis
54. R E S T ( X M L / J S O N ) , P Y T H O N , R , J AVA , …
O P E N M L A P I
• Get Data, Flows, Models, Evaluations
• Download data, meta-data and run files
• Upload new data, flows, runs
• Upload/download optimization traces
• Upload/download meta-models
• Build AutoML bots that run against OpenML
• Compare AutoML approaches across datasets
Springboard for AutoML services
56. (new) data
Robot assistants: data handling
initial
workflows
Data similarity bot: finds datasets similar to yours
(meta-features)
57. (new) data
Robot assistants: data handling
initial
workflows
Label imbalance bot: detects/
reduces class imbalance (e.g. SMOTE)
Data similarity bot: finds datasets similar to yours
(meta-features)
58. (new) data
Robot assistants: data handling
initial
workflows
Encoding bot: converts to numeric data
depending on ML algo (SVM, kNN, NN)
Label imbalance bot: detects/
reduces class imbalance (e.g. SMOTE)
Data similarity bot: finds datasets similar to yours
(meta-features)
59. (new) data
Robot assistants: data handling
initial
workflows
Data leak bot: detects if test data leaks
into the training set
Encoding bot: converts to numeric data
depending on ML algo (SVM, kNN, NN)
Label imbalance bot: detects/
reduces class imbalance (e.g. SMOTE)
Data similarity bot: finds datasets similar to yours
(meta-features)
62. (new) data
initial
workflows
Feature selection bot: recommends/runs
feature selection techniques
Runtime prediction bot: predicts how long an ML
algorithm will run on your data
Robot assistants: preprocessing
63. (new) data
initial
workflows
Feature selection bot: recommends/runs
feature selection techniques
Runtime prediction bot: predicts how long an ML
algorithm will run on your data
Imputation bot: recommends/runs
missing value imputation techniques
Robot assistants: preprocessing
64. (new) data
initial
workflows
Feature selection bot: recommends/runs
feature selection techniques
Runtime prediction bot: predicts how long an ML
algorithm will run on your data
Imputation bot: recommends/runs
missing value imputation techniques
Outlier detection bot: recommends/
runs outlier detection techniques
Robot assistants: preprocessing
67. (new) data
initial
workflows
Random Bot: runs random search given a hyperparameter
space
Greedy Bot: learns key algorithms, hyper-
parameters, ranges. Tries those first.
Robot assistants: model selection
68. (new) data
initial
workflows
Random Bot: runs random search given a hyperparameter
space
Greedy Bot: learns key algorithms, hyper-
parameters, ranges. Tries those first.
Optimization bots: runs advanced
hyperparameter optimization
Robot assistants: model selection
69. (new) data
initial
workflows
Random Bot: runs random search given a hyperparameter
space
Greedy Bot: learns key algorithms, hyper-
parameters, ranges. Tries those first.
Optimization bots: runs advanced
hyperparameter optimization
Workflow bot: build ML workflows,
in collaboration with other bots
Robot assistants: model selection
71. Include meta-data from
prior datasets
• Warm start: initialize search
with promising configurations
(AutoML challenge winner)
• Surrogate models with prior
(focus on best parameters,
ranges)
• Acquisition functions based
on meta-models
(predict performance, trained
on prior datasets)
Optimization + metalearning
73. Frugal learning
• Can we run ML on wearable devices instead of transmitting data?
• Which algorithms are very fast and highly accurate?
• Best algorithms implemented on smartwatch for activity recognition
74. Meta-learning on streams
Stream data in OpenML: ‘best’ algorithm changes over time
Concept drift
• Use meta-learning to select the best models at each
point in time
75. Meta-learning on streams
• Streaming ensembles
• Train multiple models, use meta-learner to weight the votes
of all learners for the next window
• BLast (Best-Last)
• Choose models that performed best in previous window.
Equivalent to state-of-the-art, but much faster.
J. van Rijn et al. ICDM 2015
76. Fast algorithm selection (ranking)
J. van Rijn et al. IDA 2015
Learning curves in OpenML
• Identify k nearest prior datasets by distance between partial curves:
• Start with strong learner abest
• Choose and evaluate competitor algorithm acompetitor
• Choose acompetitor that beats abest on most similar prior datasets
For new dataset: build partial
learning curves up to T
(e.g. 256 instances)
Use learning curves to
compute dataset similarity
77. Fast algorithm selection
0
0.02
0.04
0.06
0.08
0.1
1 4 16 64 256 1024 4096 16384 65536
AccuracyLoss
Time (seconds)
Best on Sample
Average Rank
PCC
PCC (A3R, r = 1)
Pairwise Curve Comparison (PCC), with multi-objective measure A3R
- Trade-off high expected performance and fast training time
Results for 53 classifiers on 39 datasets
J. van Rijn et al. IDA 2015
79. Help us :)
We are always looking for:
- Code contributions (open source)
- New tool/platform integrations
- E.g. Keras/TensorFlow
- New bots
- Your own ideas
- Interesting datasets
- Computing resources (!)
- Funding ideas