SlideShare ist ein Scribd-Unternehmen logo
1 von 44
LEVERAGING OPEN SOURCE
E D U A R D O A R I Ñ O D E L A R U B I A
C H I E F D A T A S C I E N T I S T , D O M I N O D A T A L A B
E D U A R D O @ D O M I N O D A T A L A B . C O M
T W I T T E R : @ E A R I N O
A U T O M A T E D D A T A S C I E N C E T O O L S
CONTENTS
Introduction
1
WELCOME TO MY DATA POPUP TALK
Some background
2
Tools Available
3
A nice self-serving
way to eat up at least
a few minutes of this
talk.
INTRODUCTION
PICTURE SLIDE
DATA SCIENTIST
A BIT ABOUT ME
A QUICK TIMELINE
Manufacturing &
Logistics
Let’s discuss what is ML,
what is data science, and
make sure we’re all using the
same words to mean the
same things.
SOME BACKGROUND
FIND A CATEGORY
Detect defective, classify
workloads, categorize
vendors
WHAT IS MACHINE LEARNING?
FIND A NUMBER
Predict yields, decide optimal
run rates, predict tolerances
FIND STRUCTURE
Competitive intelligence,
understand vendor
processes, market segments
KMEANS, KOHONEN
SOM
Field of study that gives computers the ability to learn without being explicitly
programmed"
GLM, RIDGE, ETC…
KNN, NEURAL NET,
ETC.
Biology is not the study of microscopes. Though they
sure make biology a whole lot easier, they are a tool.
ML plays a part in the data science process, but data
science is not just applied ML. They make it a whole lot
easier, it is a tool.
ML IS NOT
DATA SCIENCE
SO WHAT CAN WE AUTOMATE?
(C) SZILARD PAFKA
(C) SZILARD PAFKA
(C) SZILARD PAFKA
(C) SZILARD PAFKA
So now that we’ve spent some
time together, what are some
good open source tools we can
use?
TOOLS AVAILABLE
ANGRY OLD MAN
RANT
Data Science tools are incredibly automated!
We’re in a golden age of data science automation.
It’s really not very long ago that in order to train a
model you had to go out into some professor’s FTP
server and figure out how to get some library to
even compile.
Here are some things we just take for granted that
are now automated…
The original sample is
randomly partitioned into k
equal sized subsamples
CROSS VALIDATION
1
Hyperparameter sweeps
are something that you just
simply had to code by hand
GRID SEARCH
3
Scaling? Centering? Box cox?
These were things that you
had to do by hand, and doing
them wrong was bad.
PRE PROCESSING
2
Have you ever used a plotting
library which allowed you to
facet? That used to be a thing you
just had to make by hand
VISUALIZATION
4
6
Both R and Python now provide
multiple feature selection
strategies, from RFE to threshold
approaches
FEATURE SELECTION
5
This one blows my mind. With
tools like h2o’s ensembling, you
can literally just build ensembles
of learners with 1 line of code.
ENSEMBLING
All the interesting problems
are unbalanced class
problems.
balance_classes=TRUE???
CLASS BALANCES
8
This space intentionally left
empty for future
developments
ETC…
3
Oh for goodness sakes, google’s
Automatic Machine Learning
freaking designs entire new deep
learning architectures???
DEEP ARCHITECTURES
9
BUT DON’T FORGET HOW LUCKY WE ARE
Between the massive hardware that is available to us, and the
incredible libraries that have been created by the community,
we’re infinitely more productive than we were just a few years
ago.
But we want even more automation… so let’s talk about some
cool tools :)
WE’RE SPOILED
AUTOMATED
DATA SCIENCE
IS HUNGRY FOR RESOURCES
FEATURE
ENGINEERING
Feature engineering is often considered the dark art of data science. Like
when your differential equations professor told you that you should “stare at
it” until it made sense.
scikit-feature is an open-source feature selection repository in Python developed by Data Mining and Machine Learning Lab at Ari zona State
University. It is built upon one widely used machine learning package scikit-learn and two scientific computing packages Numpy and Scipy. scikit-
feature contains around 40 popular feature selection algorithms, including traditional feature selection algorithms and some structural and
streaming feature selection algorithms.
SCIKIT FEATURE
SO COOL RIGHT
SADLY IT SEEMS
TO BE MOSTLY
ABANDONED
HELPS MAKE THE SAUSAGE
A 'data.frame' processor/conditioner that prepares real-world data for
predictive modeling in a statistically sound manner. 'vtreat' prepares
variables so that data has fewer exceptional cases, making it easier
to safely use models in production. Common problems 'vtreat'
defends against: 'Inf', 'NA', too many categorical levels, rare
categorical levels, and new categorical levels (levels seen during
application, but not during training).
VTREAT
THERE’S A TON MORE
SO MANY PROBLEMS…
1. Bad numerical values (NA, NaN, sentinels)
2. Categorial values (missing levels, novel levels in production)
3. Categorical values with too many levels
4. Weird skew
Vtreat provides “y-aware” processing
Treatment of missing values
through safe replacement plus
indicator column (a simple but very
powerful method when combined
with downstream machine learning
algorithms).
1
Explicit coding of categorical variable
levels as new indicator variables
(with optional suppression of non-
significant indicators).
3
Treatment of novel levels (new
values of categorical variable seen
during test or application, but not seen
during training) through sub-models
(or impact/effects coding of pooled
rare events).
2
User specified significance pruning
on levels coded into effects/impact
sub-models
4
6
Treatment of categorical variables
with very large numbers of levels
through sub-models
5
Collaring/Winsorizing of unexpected
out of range numeric inputs (clipping)
WARNING
Your data had better be pretty clean!
These automated ML tools are amazing,
but your data needs to be in pretty good
shape. Nice, numerical, no weird missing
values…
So chain them together and use vtreat!
AND…
auto-sklearn is an automated machine learning toolkit and a drop-in
replacement for a scikit-learn estimator:
auto-sklearn frees a machine learning user from algorithm selection and
hyperparameter tuning. It leverages recent advantages in Bayesian
optimization, meta-learning and ensemble construction. Learn more about the
technology behind auto-sklearn by reading this paper published at the NIPS
2015 .
AUTO-SKLEARN
AWARDS
Of additional note, Auto-sklearn won both
the auto and the tweakathon tracks of the
ChaLearn AutoML challenge.
RANDAL
OLSON
TPOT will automate the most tedious part of
machine learning by intelligently exploring
thousands of possible pipelines to find the
best one for your data.
Once TPOT is finished searching (or you get
tired of waiting), it provides you with the
Python code for the best pipeline it found so
you can tinker with the pipeline from there.
TPOT CREATOR
Though both projects are open source, written in Python, and aimed at simplifying a machine learning process by way of AutoML , in contrast to
Auto-sklearn using Bayesian optimization, TPOT's approach is based on genetic programming.
One of the real benefits of TPOT is that it produces ready-to-run, standalone Python code for the best-performing model, in the form of a scikit-
learn pipeline. This code, representing the best performing of all candidate models, can then be modified or inspected for ad ditional insight,
effectively being able to serve as a starting point as opposed to solely as an end product.
GENETIC
PROGRAMMING
- MATTHEW MAYO, KDNUGGETS.
COMING SOON?
Supposedly is going to take advantage of
a lot of the existing infrastructure in h2o,
with ensembles in the back end, hyper
parameter search, etc…
VERY excited to see what happens next!
AUTOML
COMING SOON?
Supposedly is going to take advantage of
a lot of the existing infrastructure in h2o,
with ensembles in the back end, hyper
parameter search, etc…
VERY excited to see what happens next!
AUTOML
The current version of AutoML trains and cross-validates a Random Forest, an
Extremely-Randomized Forest, a random grid of Gradient Boosting Machines
(GBMs), a random grid of Deep Neural Nets, and a Stacked Ensemble of all
the models.
http://tiny.cc/automl
THANK YOU
R E A C H O U T A T
E D U A R D O @ D O M I N O D A T A L A B . C O M
@ E A R I N O
F O R C O M I N G T O M Y T A L K
W E A R E H I R I N G !
H T T P S : / / W W W . D O M I N O D A T A L A B . C O M / C A R E E R S /

Weitere ähnliche Inhalte

Was ist angesagt?

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 

Was ist angesagt? (20)

The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
Data science
Data scienceData science
Data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0
 

Ähnlich wie Leveraging Open Source Automated Data Science Tools

Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 

Ähnlich wie Leveraging Open Source Automated Data Science Tools (20)

From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Smart Data Webinar: Machine Learning Update
Smart Data Webinar: Machine Learning UpdateSmart Data Webinar: Machine Learning Update
Smart Data Webinar: Machine Learning Update
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Train, explain, acclaim. Build a good model in three steps
Train, explain, acclaim.  Build a good model in three stepsTrain, explain, acclaim.  Build a good model in three steps
Train, explain, acclaim. Build a good model in three steps
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Diagnosability vs The Cloud
Diagnosability vs The CloudDiagnosability vs The Cloud
Diagnosability vs The Cloud
 
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
 
EVAIN Artificial intelligence and semantic annotation: are you serious about it?
EVAIN Artificial intelligence and semantic annotation: are you serious about it?EVAIN Artificial intelligence and semantic annotation: are you serious about it?
EVAIN Artificial intelligence and semantic annotation: are you serious about it?
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 

Mehr von Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 

Mehr von Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
 
Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the Rescue
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
 
Building Up Local Models of Customers
Building Up Local Models of CustomersBuilding Up Local Models of Customers
Building Up Local Models of Customers
 
Making Investing A Science
Making Investing A ScienceMaking Investing A Science
Making Investing A Science
 

Kürzlich hochgeladen

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Leveraging Open Source Automated Data Science Tools

  • 1. LEVERAGING OPEN SOURCE E D U A R D O A R I Ñ O D E L A R U B I A C H I E F D A T A S C I E N T I S T , D O M I N O D A T A L A B E D U A R D O @ D O M I N O D A T A L A B . C O M T W I T T E R : @ E A R I N O A U T O M A T E D D A T A S C I E N C E T O O L S
  • 2. CONTENTS Introduction 1 WELCOME TO MY DATA POPUP TALK Some background 2 Tools Available 3
  • 3. A nice self-serving way to eat up at least a few minutes of this talk. INTRODUCTION
  • 6.
  • 7.
  • 8. Let’s discuss what is ML, what is data science, and make sure we’re all using the same words to mean the same things. SOME BACKGROUND
  • 9. FIND A CATEGORY Detect defective, classify workloads, categorize vendors WHAT IS MACHINE LEARNING? FIND A NUMBER Predict yields, decide optimal run rates, predict tolerances FIND STRUCTURE Competitive intelligence, understand vendor processes, market segments KMEANS, KOHONEN SOM Field of study that gives computers the ability to learn without being explicitly programmed" GLM, RIDGE, ETC… KNN, NEURAL NET, ETC.
  • 10. Biology is not the study of microscopes. Though they sure make biology a whole lot easier, they are a tool. ML plays a part in the data science process, but data science is not just applied ML. They make it a whole lot easier, it is a tool. ML IS NOT DATA SCIENCE SO WHAT CAN WE AUTOMATE?
  • 15.
  • 16. So now that we’ve spent some time together, what are some good open source tools we can use? TOOLS AVAILABLE
  • 17. ANGRY OLD MAN RANT Data Science tools are incredibly automated! We’re in a golden age of data science automation. It’s really not very long ago that in order to train a model you had to go out into some professor’s FTP server and figure out how to get some library to even compile. Here are some things we just take for granted that are now automated…
  • 18. The original sample is randomly partitioned into k equal sized subsamples CROSS VALIDATION 1 Hyperparameter sweeps are something that you just simply had to code by hand GRID SEARCH 3 Scaling? Centering? Box cox? These were things that you had to do by hand, and doing them wrong was bad. PRE PROCESSING 2
  • 19. Have you ever used a plotting library which allowed you to facet? That used to be a thing you just had to make by hand VISUALIZATION 4 6 Both R and Python now provide multiple feature selection strategies, from RFE to threshold approaches FEATURE SELECTION 5 This one blows my mind. With tools like h2o’s ensembling, you can literally just build ensembles of learners with 1 line of code. ENSEMBLING
  • 20. All the interesting problems are unbalanced class problems. balance_classes=TRUE??? CLASS BALANCES 8 This space intentionally left empty for future developments ETC… 3 Oh for goodness sakes, google’s Automatic Machine Learning freaking designs entire new deep learning architectures??? DEEP ARCHITECTURES 9
  • 21. BUT DON’T FORGET HOW LUCKY WE ARE Between the massive hardware that is available to us, and the incredible libraries that have been created by the community, we’re infinitely more productive than we were just a few years ago. But we want even more automation… so let’s talk about some cool tools :) WE’RE SPOILED
  • 23. FEATURE ENGINEERING Feature engineering is often considered the dark art of data science. Like when your differential equations professor told you that you should “stare at it” until it made sense.
  • 24. scikit-feature is an open-source feature selection repository in Python developed by Data Mining and Machine Learning Lab at Ari zona State University. It is built upon one widely used machine learning package scikit-learn and two scientific computing packages Numpy and Scipy. scikit- feature contains around 40 popular feature selection algorithms, including traditional feature selection algorithms and some structural and streaming feature selection algorithms. SCIKIT FEATURE SO COOL RIGHT
  • 25. SADLY IT SEEMS TO BE MOSTLY ABANDONED
  • 26. HELPS MAKE THE SAUSAGE A 'data.frame' processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. 'vtreat' prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems 'vtreat' defends against: 'Inf', 'NA', too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). VTREAT
  • 27. THERE’S A TON MORE SO MANY PROBLEMS… 1. Bad numerical values (NA, NaN, sentinels) 2. Categorial values (missing levels, novel levels in production) 3. Categorical values with too many levels 4. Weird skew Vtreat provides “y-aware” processing
  • 28. Treatment of missing values through safe replacement plus indicator column (a simple but very powerful method when combined with downstream machine learning algorithms). 1 Explicit coding of categorical variable levels as new indicator variables (with optional suppression of non- significant indicators). 3 Treatment of novel levels (new values of categorical variable seen during test or application, but not seen during training) through sub-models (or impact/effects coding of pooled rare events). 2
  • 29. User specified significance pruning on levels coded into effects/impact sub-models 4 6 Treatment of categorical variables with very large numbers of levels through sub-models 5 Collaring/Winsorizing of unexpected out of range numeric inputs (clipping)
  • 30.
  • 31. WARNING Your data had better be pretty clean! These automated ML tools are amazing, but your data needs to be in pretty good shape. Nice, numerical, no weird missing values… So chain them together and use vtreat!
  • 32. AND… auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator: auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading this paper published at the NIPS 2015 . AUTO-SKLEARN
  • 33. AWARDS Of additional note, Auto-sklearn won both the auto and the tweakathon tracks of the ChaLearn AutoML challenge.
  • 34.
  • 35. RANDAL OLSON TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data. Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. TPOT CREATOR
  • 36. Though both projects are open source, written in Python, and aimed at simplifying a machine learning process by way of AutoML , in contrast to Auto-sklearn using Bayesian optimization, TPOT's approach is based on genetic programming. One of the real benefits of TPOT is that it produces ready-to-run, standalone Python code for the best-performing model, in the form of a scikit- learn pipeline. This code, representing the best performing of all candidate models, can then be modified or inspected for ad ditional insight, effectively being able to serve as a starting point as opposed to solely as an end product. GENETIC PROGRAMMING - MATTHEW MAYO, KDNUGGETS.
  • 37.
  • 38.
  • 39. COMING SOON? Supposedly is going to take advantage of a lot of the existing infrastructure in h2o, with ensembles in the back end, hyper parameter search, etc… VERY excited to see what happens next! AUTOML
  • 40. COMING SOON? Supposedly is going to take advantage of a lot of the existing infrastructure in h2o, with ensembles in the back end, hyper parameter search, etc… VERY excited to see what happens next! AUTOML
  • 41.
  • 42. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and a Stacked Ensemble of all the models. http://tiny.cc/automl
  • 43.
  • 44. THANK YOU R E A C H O U T A T E D U A R D O @ D O M I N O D A T A L A B . C O M @ E A R I N O F O R C O M I N G T O M Y T A L K W E A R E H I R I N G ! H T T P S : / / W W W . D O M I N O D A T A L A B . C O M / C A R E E R S /