SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Downloaden Sie, um offline zu lesen
OpenML
D E M O C R AT I Z I N G A N D A U T O M AT I N G
M A C H I N E L E A R N I N G
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 7
OpenML
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 6
@open_ml
Be a part of this talk:
- Log in / create an account on www.openml.org
- You also need a GitHub account
- Click the or icon
- Click ‘Launch Demo’
- Launches notebook hosted by YSDA
Democratization
Empower anybody to do data science well
Provide easy access to reproducible data/code/models
Build easily on prior work
Allow frictionless collaboration
Crowdsource hard problems, share results easily
Organized, reproducible, open results
Simplify, speed up data science process
Automate drudge work (e.g. hyperparameter optimization)
Learn from past, build ever better algorithms, models
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E I N R E A L T I M E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E I N R E A L T I M E
C O L L A B O R AT I V E M A C H I N E L E A R N I N G
Easy to use: Integrated in many ML tools/environments
Easy to contribute: Automated sharing of data, code, results
Organized data: Meta-data, reproducible models, link to people
Reward structure: Track your impact, build reputation
Self-learning: Learn from many experiments to help people
OpenML
Data (ARFF) uploaded or referenced (URL)
auto-versioned, analysed, meta-data
extracted, organised online
auto-versioned, analysed, meta-data
extracted, organised online
auto-versioned, analysed, meta-data
extracted, organised online
Tasks contain data, goals, procedures.
Readable by tools, automates experimentation
Train-test
splits
Classify target X
All results organized online: realtime overview
Train-test
splits
All results organized online: realtime overview
Classify target X
Flows (workflows, scripts) can run anywhere (locally)
Tool integrations + APIs (REST, R, Python, Java,…)
Integrations + APIs (REST, R, Python, Java,…)
Integrations + APIs (REST, R, Python, Java,…)
Integrations + APIs (REST, R, Python, Java,…)
Integrations + APIs (REST, R, Python, Java,…)
Integrations + APIs (REST, R, Python, Java,…)
reproducible, linked to data, flows, authors
and all other experiments
Experiments auto-uploaded, evaluated online
reproducible, linked to data, flows, authors
and all other experiments
Experiments auto-uploaded, evaluated online
Experiments auto-uploaded, evaluated online
Demo
OpenML Community
Nov-Dec 2016
2500+ registered,
2500 30-day active users
Studies (e-papers)
Online counterpart of a paper, chat/comments
Linked to GitHub repo, Jupyter notebooks
Circles
Create collaborations with trusted researchers
Code submissions
Sharing versioned code, docker images, archiving
GitHub integration
Collaboration tools (in progress)
Collaborative challenges
Rapid Analytics and Model Prototyping
Code submission, streamlined collaboration
Classroom challenges
Results appear in real-time, full details available immediately
Students can learn from each other, but also have to think to do better
Simple resubmissions do not count
Hidden test set is also possible
Classroom challenges
Results appear in real-time, full details available immediately
Students can learn from each other, but also have to think to do better
Simple resubmissions do not count
Hidden test set is also possible
Hyperparameter optimization challenge
Every iteration can be uploaded, prior
ones downloaded
Realtime challenges (coming up)
Predict energy usage in smart energy
API streams data, predictions uploaded
hourly
OpenML in drug discovery
ChEMBL database
SMILEs'
Molecular'
proper0es'
(e.g.'MW,'LogP)'
Fingerprints'
(e.g.'FP2,'FP3,'
FP4,'MACSS)'
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
16.000+ QSAR datasets
1.4M compounds,10k proteins,
12,8M activities
Predict which drugs will inhibit certain proteins (and hence viruses, parasites,…)
2750 targets (proteins), x 6 feature representations
Many algorithms
Many feature
representations
OpenML in drug discovery
Predicting workflows with MetaLearning
FCFP4 fingerprint
(1024 features)
Problem space P
46 QSAR dataset properties
Meta-feature extraction
Meta-feature space F
452 target (protein) properties
108 (6*18) workflows
(feature gen + preproc + learner)
meta-features -> RMSE(workflow)
Performance space Y
Meta-learner
(RandomForest)
Select workflow
Evaluate
workflow
2764 QSAR datasets / targets
Feature
generation
molecular prop.
(1447 features)
FS = Feature selection
MVI = Missing value imputation
16584 (6*2764) QSAR datasets
train
predict
18 regression algorithms
Algorithm space A
(linear, forests, kNN, SVM,…)
basic prop.
(43 features)
Feature
generation
Preprocessing:
MVI, MVI+FS, none
Preprocessing:
MVI, MVI+FS, none
1024 aggregated fingerprints
936 target family distances
OpenML in drug discovery
I. Olier et al. (2017) Machine Learning (forthcoming)
Random forest (black) vs meta-learner (grey)
Meta-learner:
mRF: Random forest
k-NN: kNN
OpenML in drug discovery
We just scratched the surface. All data is available on OpenML.
Multi-target learning: if few drugs are tested on
a given target, include data on ‘related’ targets:
STL: standard (Random Forests)
Setting 1: Distance is taxonomy (ChEMBL)
Setting 2: Pairwise sequence alignment
Data-driven modelling
Learn how to build models based on many prior experiments
AI-human interaction
Bots/services to simplify work
Find data, construct pipelines, optimize hyperparameters,…
Automating machine learning
Online collaboration
Meta-learning
optimization
Robot
assistants
Automating machine learning: a
human-robot symbiosis
(new) data
initial
workflows
improved
workflows
Online collaboration
Meta-learning
optimization
Robot
assistants
Automating machine learning: a
human-robot symbiosis
(new) data
initial
workflows
improved
workflows
Online collaboration
Meta-learning
optimization
Robot
assistants
Automating machine learning: a
human-robot symbiosis
(new) data
initial
workflows
improved
workflows
Online collaboration
Meta-learning
optimization
Robot
assistants
Automating machine learning: a
human-robot symbiosis
(new) data
initial
workflows
improved
workflows
Online collaboration
Meta-learning
optimization
Robot
assistants
Automating machine learning: a
human-robot symbiosis
OpenML
(new) data
initial
workflows
improved
workflows
datasets,
workflows,
meta-data
meta-models
(new) data Extract meta-data
- (domain-specific) meta-
features

- detect outliers, missing
values,…

- Recommend primitives
Construct pipelines
- from (learned) templates

- generative methods

- genetic programming

- by humans
Optimize pipelines
- Different strategies 

- random, model-based, bandits,…

- Learn from priors (meta-learning)

- Upload all results
Meta-learning
- find similar datasets

- predict runtime, accuracy

- predict hyperparameters

- good ranges, defaults

- trained meta-models

- recommend pipeline
components
Automating machine learning: a
human-robot symbiosis
OpenML
(new) data
datasets,
pipelines,
meta-data
meta-models
Extract meta-data
- (domain-specific) meta-
features

- detect outliers, missing
values,…

- Recommend primitives
Construct pipelines
- from (learned) templates

- generative methods

- genetic programming

- by humans
Optimize pipelines
- Different strategies 

- random, model-based, bandits,…

- Learn from priors (meta-learning)

- Upload all results
Meta-learning
- find similar datasets

- predict runtime, accuracy

- predict hyperparameters

- good ranges, defaults

- trained meta-models

- recommend pipeline
components
Automating machine learning: a
human-robot symbiosis
R E S T ( X M L / J S O N ) , P Y T H O N , R , J AVA , …
O P E N M L A P I
• Get Data, Flows, Models, Evaluations
• Download data, meta-data and run files
• Upload new data, flows, runs
• Upload/download optimization traces
• Upload/download meta-models
• Build AutoML bots that run against OpenML
• Compare AutoML approaches across datasets
Springboard for AutoML services
(new) data
Robot assistants: data handling
initial
workflows
(new) data
Robot assistants: data handling
initial
workflows
Data similarity bot: finds datasets similar to yours
(meta-features)
(new) data
Robot assistants: data handling
initial
workflows
Label imbalance bot: detects/
reduces class imbalance (e.g. SMOTE)
Data similarity bot: finds datasets similar to yours
(meta-features)
(new) data
Robot assistants: data handling
initial
workflows
Encoding bot: converts to numeric data
depending on ML algo (SVM, kNN, NN)
Label imbalance bot: detects/
reduces class imbalance (e.g. SMOTE)
Data similarity bot: finds datasets similar to yours
(meta-features)
(new) data
Robot assistants: data handling
initial
workflows
Data leak bot: detects if test data leaks
into the training set
Encoding bot: converts to numeric data
depending on ML algo (SVM, kNN, NN)
Label imbalance bot: detects/
reduces class imbalance (e.g. SMOTE)
Data similarity bot: finds datasets similar to yours
(meta-features)
(new) data
initial
workflows
Robot assistants: preprocessing
(new) data
initial
workflows
Runtime prediction bot: predicts how long an ML
algorithm will run on your data
Robot assistants: preprocessing
(new) data
initial
workflows
Feature selection bot: recommends/runs
feature selection techniques
Runtime prediction bot: predicts how long an ML
algorithm will run on your data
Robot assistants: preprocessing
(new) data
initial
workflows
Feature selection bot: recommends/runs
feature selection techniques
Runtime prediction bot: predicts how long an ML
algorithm will run on your data
Imputation bot: recommends/runs
missing value imputation techniques
Robot assistants: preprocessing
(new) data
initial
workflows
Feature selection bot: recommends/runs
feature selection techniques
Runtime prediction bot: predicts how long an ML
algorithm will run on your data
Imputation bot: recommends/runs
missing value imputation techniques
Outlier detection bot: recommends/
runs outlier detection techniques
Robot assistants: preprocessing
(new) data
initial
workflows
Robot assistants: model selection
(new) data
initial
workflows
Random Bot: runs random search given a hyperparameter
space
Robot assistants: model selection
(new) data
initial
workflows
Random Bot: runs random search given a hyperparameter
space
Greedy Bot: learns key algorithms, hyper-
parameters, ranges. Tries those first.
Robot assistants: model selection
(new) data
initial
workflows
Random Bot: runs random search given a hyperparameter
space
Greedy Bot: learns key algorithms, hyper-
parameters, ranges. Tries those first.
Optimization bots: runs advanced
hyperparameter optimization
Robot assistants: model selection
(new) data
initial
workflows
Random Bot: runs random search given a hyperparameter
space
Greedy Bot: learns key algorithms, hyper-
parameters, ranges. Tries those first.
Optimization bots: runs advanced
hyperparameter optimization
Workflow bot: build ML workflows,
in collaboration with other bots
Robot assistants: model selection
Random Bot running on OpenML
Include meta-data from
prior datasets
• Warm start: initialize search
with promising configurations
(AutoML challenge winner)
• Surrogate models with prior
(focus on best parameters,
ranges)
• Acquisition functions based
on meta-models
(predict performance, trained
on prior datasets)
Optimization + metalearning
• Acquisition functions based on meta-models
Optimization + metalearning
Frugal learning
• Can we run ML on wearable devices instead of transmitting data?
• Which algorithms are very fast and highly accurate?
• Best algorithms implemented on smartwatch for activity recognition
Meta-learning on streams
Stream data in OpenML: ‘best’ algorithm changes over time
Concept drift
• Use meta-learning to select the best models at each
point in time
Meta-learning on streams
• Streaming ensembles
• Train multiple models, use meta-learner to weight the votes
of all learners for the next window
• BLast (Best-Last)
• Choose models that performed best in previous window.
Equivalent to state-of-the-art, but much faster.
J. van Rijn et al. ICDM 2015
Fast algorithm selection (ranking)
J. van Rijn et al. IDA 2015
Learning curves in OpenML
• Identify k nearest prior datasets by distance between partial curves:
• Start with strong learner abest
• Choose and evaluate competitor algorithm acompetitor
• Choose acompetitor that beats abest on most similar prior datasets
For new dataset: build partial
learning curves up to T
(e.g. 256 instances)
Use learning curves to
compute dataset similarity
Fast algorithm selection
0
0.02
0.04
0.06
0.08
0.1
1 4 16 64 256 1024 4096 16384 65536
AccuracyLoss
Time (seconds)
Best on Sample
Average Rank
PCC
PCC (A3R, r = 1)
Pairwise Curve Comparison (PCC), with multi-objective measure A3R
- Trade-off high expected performance and fast training time
Results for 53 classifiers on 39 datasets
J. van Rijn et al. IDA 2015
Join Us!
www.openml.org
Join our hackathons
- June, Heidelberg
- Oct 9, Leiden
Help us :)
We are always looking for:
- Code contributions (open source)
- New tool/platform integrations
- E.g. Keras/TensorFlow
- New bots
- Your own ideas
- Interesting datasets
- Computing resources (!)
- Funding ideas
Thank You
@open_ml
OpenML

Weitere ähnliche Inhalte

Ähnlich wie OpenML DALI

OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017Joaquin Vanschoren
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine LearningJoaquin Vanschoren
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at ScaleDavid Simons
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstarsStephan Hochhaus
 
xAPI 101 - webinar slides
xAPI 101 - webinar slides  xAPI 101 - webinar slides
xAPI 101 - webinar slides Sprout Labs
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
SharePoint Saturday Redmond - Building solutions with the future in mind
SharePoint Saturday Redmond - Building solutions with the future in mindSharePoint Saturday Redmond - Building solutions with the future in mind
SharePoint Saturday Redmond - Building solutions with the future in mindChris Johnson
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookKeiichiro Ono
 
Nuno Job - what's next for software - ANDdigital tech summit
Nuno Job - what's next for software - ANDdigital tech summitNuno Job - what's next for software - ANDdigital tech summit
Nuno Job - what's next for software - ANDdigital tech summitGreta Strolyte
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Linked data business models
Linked data business modelsLinked data business models
Linked data business modelsJesus Contreras
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 

Ähnlich wie OpenML DALI (20)

OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
 
xAPI 101 - webinar slides
xAPI 101 - webinar slides  xAPI 101 - webinar slides
xAPI 101 - webinar slides
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
SharePoint Saturday Redmond - Building solutions with the future in mind
SharePoint Saturday Redmond - Building solutions with the future in mindSharePoint Saturday Redmond - Building solutions with the future in mind
SharePoint Saturday Redmond - Building solutions with the future in mind
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 
Nuno Job - what's next for software - ANDdigital tech summit
Nuno Job - what's next for software - ANDdigital tech summitNuno Job - what's next for software - ANDdigital tech summit
Nuno Job - what's next for software - ANDdigital tech summit
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Meteor WWNRW Intro
Meteor WWNRW IntroMeteor WWNRW Intro
Meteor WWNRW Intro
 
Linked data business models
Linked data business modelsLinked data business models
Linked data business models
 
SENCER_panel.ppt
SENCER_panel.pptSENCER_panel.ppt
SENCER_panel.ppt
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 

Mehr von Joaquin Vanschoren (14)

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
 
Designed Serendipity
Designed SerendipityDesigned Serendipity
Designed Serendipity
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Kürzlich hochgeladen

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 

Kürzlich hochgeladen (20)

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

OpenML DALI

  • 1. OpenML D E M O C R AT I Z I N G A N D A U T O M AT I N G M A C H I N E L E A R N I N G J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 7
  • 2. OpenML J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 6 @open_ml Be a part of this talk: - Log in / create an account on www.openml.org - You also need a GitHub account - Click the or icon - Click ‘Launch Demo’ - Launches notebook hosted by YSDA
  • 3. Democratization Empower anybody to do data science well Provide easy access to reproducible data/code/models Build easily on prior work Allow frictionless collaboration Crowdsource hard problems, share results easily Organized, reproducible, open results Simplify, speed up data science process Automate drudge work (e.g. hyperparameter optimization) Learn from past, build ever better algorithms, models
  • 4. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY
  • 5. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY
  • 6. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY
  • 7. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY
  • 8. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY
  • 9. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 10. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 11. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 12. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 13. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 14. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 15. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 16. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 17. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 18. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E I N R E A L T I M E
  • 19. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E I N R E A L T I M E
  • 20. C O L L A B O R AT I V E M A C H I N E L E A R N I N G Easy to use: Integrated in many ML tools/environments Easy to contribute: Automated sharing of data, code, results Organized data: Meta-data, reproducible models, link to people Reward structure: Track your impact, build reputation Self-learning: Learn from many experiments to help people OpenML
  • 21. Data (ARFF) uploaded or referenced (URL) auto-versioned, analysed, meta-data extracted, organised online
  • 24. Tasks contain data, goals, procedures. Readable by tools, automates experimentation Train-test splits Classify target X All results organized online: realtime overview
  • 25. Train-test splits All results organized online: realtime overview Classify target X
  • 26. Flows (workflows, scripts) can run anywhere (locally) Tool integrations + APIs (REST, R, Python, Java,…)
  • 27. Integrations + APIs (REST, R, Python, Java,…)
  • 28. Integrations + APIs (REST, R, Python, Java,…)
  • 29. Integrations + APIs (REST, R, Python, Java,…)
  • 30. Integrations + APIs (REST, R, Python, Java,…)
  • 31. Integrations + APIs (REST, R, Python, Java,…)
  • 32. reproducible, linked to data, flows, authors and all other experiments Experiments auto-uploaded, evaluated online
  • 33. reproducible, linked to data, flows, authors and all other experiments Experiments auto-uploaded, evaluated online
  • 35. Demo
  • 36. OpenML Community Nov-Dec 2016 2500+ registered, 2500 30-day active users
  • 37. Studies (e-papers) Online counterpart of a paper, chat/comments Linked to GitHub repo, Jupyter notebooks Circles Create collaborations with trusted researchers Code submissions Sharing versioned code, docker images, archiving GitHub integration Collaboration tools (in progress) Collaborative challenges Rapid Analytics and Model Prototyping Code submission, streamlined collaboration
  • 38. Classroom challenges Results appear in real-time, full details available immediately Students can learn from each other, but also have to think to do better Simple resubmissions do not count Hidden test set is also possible
  • 39. Classroom challenges Results appear in real-time, full details available immediately Students can learn from each other, but also have to think to do better Simple resubmissions do not count Hidden test set is also possible
  • 40. Hyperparameter optimization challenge Every iteration can be uploaded, prior ones downloaded Realtime challenges (coming up) Predict energy usage in smart energy API streams data, predictions uploaded hourly
  • 41. OpenML in drug discovery ChEMBL database SMILEs' Molecular' proper0es' (e.g.'MW,'LogP)' Fingerprints' (e.g.'FP2,'FP3,' FP4,'MACSS)' MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 16.000+ QSAR datasets 1.4M compounds,10k proteins, 12,8M activities Predict which drugs will inhibit certain proteins (and hence viruses, parasites,…) 2750 targets (proteins), x 6 feature representations
  • 43. Predicting workflows with MetaLearning FCFP4 fingerprint (1024 features) Problem space P 46 QSAR dataset properties Meta-feature extraction Meta-feature space F 452 target (protein) properties 108 (6*18) workflows (feature gen + preproc + learner) meta-features -> RMSE(workflow) Performance space Y Meta-learner (RandomForest) Select workflow Evaluate workflow 2764 QSAR datasets / targets Feature generation molecular prop. (1447 features) FS = Feature selection MVI = Missing value imputation 16584 (6*2764) QSAR datasets train predict 18 regression algorithms Algorithm space A (linear, forests, kNN, SVM,…) basic prop. (43 features) Feature generation Preprocessing: MVI, MVI+FS, none Preprocessing: MVI, MVI+FS, none 1024 aggregated fingerprints 936 target family distances
  • 44. OpenML in drug discovery I. Olier et al. (2017) Machine Learning (forthcoming) Random forest (black) vs meta-learner (grey) Meta-learner: mRF: Random forest k-NN: kNN
  • 45. OpenML in drug discovery We just scratched the surface. All data is available on OpenML. Multi-target learning: if few drugs are tested on a given target, include data on ‘related’ targets: STL: standard (Random Forests) Setting 1: Distance is taxonomy (ChEMBL) Setting 2: Pairwise sequence alignment
  • 46. Data-driven modelling Learn how to build models based on many prior experiments AI-human interaction Bots/services to simplify work Find data, construct pipelines, optimize hyperparameters,… Automating machine learning
  • 47. Online collaboration Meta-learning optimization Robot assistants Automating machine learning: a human-robot symbiosis (new) data initial workflows improved workflows
  • 48. Online collaboration Meta-learning optimization Robot assistants Automating machine learning: a human-robot symbiosis (new) data initial workflows improved workflows
  • 49. Online collaboration Meta-learning optimization Robot assistants Automating machine learning: a human-robot symbiosis (new) data initial workflows improved workflows
  • 50. Online collaboration Meta-learning optimization Robot assistants Automating machine learning: a human-robot symbiosis (new) data initial workflows improved workflows
  • 51. Online collaboration Meta-learning optimization Robot assistants Automating machine learning: a human-robot symbiosis OpenML (new) data initial workflows improved workflows datasets, workflows, meta-data meta-models
  • 52. (new) data Extract meta-data - (domain-specific) meta- features - detect outliers, missing values,… - Recommend primitives Construct pipelines - from (learned) templates - generative methods - genetic programming - by humans Optimize pipelines - Different strategies - random, model-based, bandits,… - Learn from priors (meta-learning) - Upload all results Meta-learning - find similar datasets - predict runtime, accuracy - predict hyperparameters - good ranges, defaults - trained meta-models - recommend pipeline components Automating machine learning: a human-robot symbiosis
  • 53. OpenML (new) data datasets, pipelines, meta-data meta-models Extract meta-data - (domain-specific) meta- features - detect outliers, missing values,… - Recommend primitives Construct pipelines - from (learned) templates - generative methods - genetic programming - by humans Optimize pipelines - Different strategies - random, model-based, bandits,… - Learn from priors (meta-learning) - Upload all results Meta-learning - find similar datasets - predict runtime, accuracy - predict hyperparameters - good ranges, defaults - trained meta-models - recommend pipeline components Automating machine learning: a human-robot symbiosis
  • 54. R E S T ( X M L / J S O N ) , P Y T H O N , R , J AVA , … O P E N M L A P I • Get Data, Flows, Models, Evaluations • Download data, meta-data and run files • Upload new data, flows, runs • Upload/download optimization traces • Upload/download meta-models • Build AutoML bots that run against OpenML • Compare AutoML approaches across datasets Springboard for AutoML services
  • 55. (new) data Robot assistants: data handling initial workflows
  • 56. (new) data Robot assistants: data handling initial workflows Data similarity bot: finds datasets similar to yours (meta-features)
  • 57. (new) data Robot assistants: data handling initial workflows Label imbalance bot: detects/ reduces class imbalance (e.g. SMOTE) Data similarity bot: finds datasets similar to yours (meta-features)
  • 58. (new) data Robot assistants: data handling initial workflows Encoding bot: converts to numeric data depending on ML algo (SVM, kNN, NN) Label imbalance bot: detects/ reduces class imbalance (e.g. SMOTE) Data similarity bot: finds datasets similar to yours (meta-features)
  • 59. (new) data Robot assistants: data handling initial workflows Data leak bot: detects if test data leaks into the training set Encoding bot: converts to numeric data depending on ML algo (SVM, kNN, NN) Label imbalance bot: detects/ reduces class imbalance (e.g. SMOTE) Data similarity bot: finds datasets similar to yours (meta-features)
  • 61. (new) data initial workflows Runtime prediction bot: predicts how long an ML algorithm will run on your data Robot assistants: preprocessing
  • 62. (new) data initial workflows Feature selection bot: recommends/runs feature selection techniques Runtime prediction bot: predicts how long an ML algorithm will run on your data Robot assistants: preprocessing
  • 63. (new) data initial workflows Feature selection bot: recommends/runs feature selection techniques Runtime prediction bot: predicts how long an ML algorithm will run on your data Imputation bot: recommends/runs missing value imputation techniques Robot assistants: preprocessing
  • 64. (new) data initial workflows Feature selection bot: recommends/runs feature selection techniques Runtime prediction bot: predicts how long an ML algorithm will run on your data Imputation bot: recommends/runs missing value imputation techniques Outlier detection bot: recommends/ runs outlier detection techniques Robot assistants: preprocessing
  • 66. (new) data initial workflows Random Bot: runs random search given a hyperparameter space Robot assistants: model selection
  • 67. (new) data initial workflows Random Bot: runs random search given a hyperparameter space Greedy Bot: learns key algorithms, hyper- parameters, ranges. Tries those first. Robot assistants: model selection
  • 68. (new) data initial workflows Random Bot: runs random search given a hyperparameter space Greedy Bot: learns key algorithms, hyper- parameters, ranges. Tries those first. Optimization bots: runs advanced hyperparameter optimization Robot assistants: model selection
  • 69. (new) data initial workflows Random Bot: runs random search given a hyperparameter space Greedy Bot: learns key algorithms, hyper- parameters, ranges. Tries those first. Optimization bots: runs advanced hyperparameter optimization Workflow bot: build ML workflows, in collaboration with other bots Robot assistants: model selection
  • 70. Random Bot running on OpenML
  • 71. Include meta-data from prior datasets • Warm start: initialize search with promising configurations (AutoML challenge winner) • Surrogate models with prior (focus on best parameters, ranges) • Acquisition functions based on meta-models (predict performance, trained on prior datasets) Optimization + metalearning
  • 72. • Acquisition functions based on meta-models Optimization + metalearning
  • 73. Frugal learning • Can we run ML on wearable devices instead of transmitting data? • Which algorithms are very fast and highly accurate? • Best algorithms implemented on smartwatch for activity recognition
  • 74. Meta-learning on streams Stream data in OpenML: ‘best’ algorithm changes over time Concept drift • Use meta-learning to select the best models at each point in time
  • 75. Meta-learning on streams • Streaming ensembles • Train multiple models, use meta-learner to weight the votes of all learners for the next window • BLast (Best-Last) • Choose models that performed best in previous window. Equivalent to state-of-the-art, but much faster. J. van Rijn et al. ICDM 2015
  • 76. Fast algorithm selection (ranking) J. van Rijn et al. IDA 2015 Learning curves in OpenML • Identify k nearest prior datasets by distance between partial curves: • Start with strong learner abest • Choose and evaluate competitor algorithm acompetitor • Choose acompetitor that beats abest on most similar prior datasets For new dataset: build partial learning curves up to T (e.g. 256 instances) Use learning curves to compute dataset similarity
  • 77. Fast algorithm selection 0 0.02 0.04 0.06 0.08 0.1 1 4 16 64 256 1024 4096 16384 65536 AccuracyLoss Time (seconds) Best on Sample Average Rank PCC PCC (A3R, r = 1) Pairwise Curve Comparison (PCC), with multi-objective measure A3R - Trade-off high expected performance and fast training time Results for 53 classifiers on 39 datasets J. van Rijn et al. IDA 2015
  • 78. Join Us! www.openml.org Join our hackathons - June, Heidelberg - Oct 9, Leiden
  • 79. Help us :) We are always looking for: - Code contributions (open source) - New tool/platform integrations - E.g. Keras/TensorFlow - New bots - Your own ideas - Interesting datasets - Computing resources (!) - Funding ideas