SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
H2O.ai

Machine Intelligence
Top 10 Data Science
Practitioner Pitfalls
Erin LeDell and Mark Landry
Silicon Valley Big Data Science
September 2015
H2O.ai

Machine Intelligence
H2O.ai
H2O Company
H2O Software
• Team: ~35. Founded in 2012, Mountain View, CA
• Stanford Math & Systems Engineers
• Open Source Software (Apache 2.0 License)

• Ease of Use via Web Interface
• R, Python, Scala, Spark & Hadoop Interfaces
• Distributed Algorithms Scale to Big Data
H2O.ai

Machine Intelligence
Scientific Advisory Council
Dr. Trevor Hastie
Dr. Rob Tibshirani
Dr. Stephen Boyd
• John A. Overdeck Professor of Mathematics, Stanford University
• PhD in Statistics, Stanford University
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Co-author with John Chambers, Statistical Models in S
• Co-author, Generalized Additive Models
• 108,404 citations (via Google Scholar)
• Professor of Statistics and Health Research and Policy, Stanford University
• PhD in Statistics, Stanford University
• COPPS Presidents’ Award recipient
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Author, Regression Shrinkage and Selection via the Lasso
• Co-author, An Introduction to the Bootstrap
• Professor of Electrical Engineering and Computer Science, Stanford University
• PhD in Electrical Engineering and Computer Science, UC Berkeley
• Co-author, Convex Optimization
• Co-author, Linear Matrix Inequalities in System and Control Theory
• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers
H2O.ai

Machine Intelligence
What is Data Science?
Problem
Formulation
• Identify an outcome of interest and the type of task:
classification / regression / clustering
• Identify the potential predictor variables
• Identify the independent sampling units
• Conduct research experiment (e.g. Clinical Trial)
• Collect examples / randomly sample the population
• Transform, clean, impute, filter, aggregate data
• Prepare the data for machine learning — X, Y
• Modeling using a machine learning algorithm (training)
• Model evaluation and comparison
• Sensitivity & Cost Analysis
• Translate results into action items
• Feed results into research pipeline
Collect &
Process Data
Machine Learning
Insights & Action
H2O.ai

Machine Intelligence
Classification
Clustering
Machine Learning Task Overview
• Predict a real-valued response (viral load, weight)
• Gaussian, Gamma, Poisson and Tweedie
• MSE and R^2
• Multi-class or Binary classification
• Ranking
• Accuracy and AUC
• Unsupervised learning (no training labels)
• Partition the data / identify clusters
• AIC and BIC
Regression
H2O.ai

Machine Intelligence
Machine Learning Workflow
Source: NLTK
Example of a supervised machine learning workflow.
H2O.ai

Machine Intelligence
Train vs Test
1 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
1. Train vs Test
Training Set vs.
Test Set
• Partition the original data (randomly or stratified) into
a training set and a test set. (e.g. 70/30)
• It can be useful to evaluate the training error, but you
should not look at training error alone.
• Training error is not an estimate of generalization
error (on a test set or cross-validated), which is what
you should care more about.
• Training error vs test error over time is an useful thing
to calculate. It can tell you when you start to overfit
your model, so it is a useful metric in supervised
machine learning.
• Be careful of data leakage (from the training set
into the test set).
• If you are using pooled repeated measures data
(vs iid data), you must ensure that all rows
associated with a cluster/individual are either in train
or test, but not in both.
Training Error vs.
Test Error
Data Leakage
H2O.ai

Machine Intelligence
1. Train vs Test Error
Source: Elements of Statistical Learning
H2O.ai

Machine Intelligence
Validation Set
2 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
2. Train vs Test vs Valid
Training Set vs.
Validation Set vs.
Test Set
• If you have “enough” data and plan to do some model
tuning, you should really partition your data into three
parts — Training, Validation and Test sets.
• There is no general rule for how you should partition
the data and it will depend on how strong the signal in
your data is, but an example could be: 50% Train,
25% Validation and 25% Test
• The validation set is used strictly for model tuning
(via validation of models with different parameters)
and the test set is used to make a final estimate of the
generalization error.
Validation is for
Model Tuning
H2O.ai

Machine Intelligence
Model Performance
3 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
3. Model Performance
Test Error
• Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the training set and evaluate
performance (a single time) on the test set.
• Train & test K
models as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure, Log-loss
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance
Metrics
H2O.ai

Machine Intelligence
Class Imbalance
4 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
4. Class Imbalance
Imbalanced
Response Variable
• A dataset is said to be imbalanced when the binomial
or multinomial response variable has one or more
classes that are underrepresented in the training data,
with respect to the other classes.
• This is incredibly common in real-word datasets.
• In practice, balanced datasets are the rarity, unless
they have been artificially created.
• There is no precise definition of what defines an
imbalanced vs balanced dataset — the term is vague.
• My rule of thumb for binary response: If the minority
class makes <10% of the data, this can cause issues.
• Advertising — Probability that someone clicks on ad is 

very low… very very low.
• Healthcare & Medicine — Certain diseases or adverse
medical conditions are rare.
• Fraud Detection — Insurance or credit fraud is rare.
Very common
Industries
H2O.ai

Machine Intelligence
4. Simple Remedies
Artificial Balance • You can balance the training set using sampling.
• Notice that we don’t say to balance the test set. The
test set represents the true data distribution. The only
way to get “honest” model performance on your test
set is to use the original, unbalanced, test set.
• The same goes for the hold-out sets in cross-
validation. For this, you may end up having to write
custom code, depending on what software you use.
• H2O has a “balance_classes” argument that can be used
to do this properly & automatically.
• You can manually upsample (or downsample) your
minority (or majority) class(es) set either by duplicating (or
sub-sampling) rows, or by using row weights.
• The SMOTE (Synthetic Minority Oversampling Technique)
algorithm generates simulated training examples from the
minority class instead of upsampling.
Potential Pitfalls
Solutions
H2O.ai

Machine Intelligence
4. Advanced Remedies
AUC-Maximizing
Algorithms
• There are ways to tackle this issue more directly.
• By using algorithms that optimize a metric that is
insensitive to prior class probabilities — for
example, Area Under the ROC Curve (AUC).
• Many algorithms work by optimizing a metric
equivalent or similar to accuracy. If your data is
imbalanced, this will not produce a good model since
you can have excellent accuracy and poor AUC.
Cost-Sensitive
Training
• Use a cost function to penalize the types of errors
you care about most more harshly.
• Cost Matrix:
H2O.ai

Machine Intelligence
Categorical Data
5 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
5. Categorical Data
Real Data • Most real world datasets contain categorical data.
• Problems can arise if you have too many categories.
• A lot of ML software will place limits on the number of
categories allowed in a single column (e.g. 1024) so
you may be forced to deal with this whether you like it
or not.
• When there are high-cardinality categorical columns,
often there will be many categories that only occur a
small number of times (not very useful).
• If you have some hierarchical knowledge about the data,
then you may be able to reduce the number of categories
by using some sensible higher-level mapping of the
categories.
• Example: ICD-9 codes — thousands of unique diagnostic
and procedure codes. You can map each category to a
higher level super-category to reduce the cardinality.
Too Many
Categories
Solutions
H2O.ai

Machine Intelligence
5. Missing Categories
Missing Data • There are many approaches to imputing categorical
data. The simplest approach is to impute all missing
values with the mode (the category that occurs most).
• When your data is split into training and test sets,
there may be categories that are represented in the
training set but not in the test set and vice versa.
• If you have expanded your categorical variable into a group
of binary indicator columns equal to the number of
categories, then new categories in the test set should not
cause any problems. Example: If you expand a
categorical (cat, dog) into “cat” and “dog” indicator columns
and your test set has a “rat” in it, then the value in each of
those columns will be 0 — Neither cat nor dog.
• If the algorithm you are using (e.g. Random Forest)
implicitly uses the categories then you may want to add an
“Other” column that all new categories will be grouped into.
Training vs.
Test Categories
New Categories in
Test Set
H2O.ai

Machine Intelligence
Missing Data
6 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
6. Missing Data
Types of
Missing Data
• Unavailable: Valid for the observation, but not
available in the data set.
• Removed: Observation quality threshold may have
not been reached, and data removed
• Not applicable: measurement does not apply to the
particular observation (e.g. number of tires on a boat
observation)
• It depends! Some options:
• Ignore entire observation.
• Create an binary variable for each predictor to
indicate whether the data was missing or not
• Segment model based on data availability.
• Use alternative algorithm: decision trees accept
missing values; linear models typically do not.
What to Do
H2O.ai

Machine Intelligence
Outliers
7 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
7. Outliers
Types of Outliers
• Outliers can exist in response or predictors
• Valid outliers: rare, extreme events
• Invalid outliers: erroneous measurements
• Remove observations.
• Apply a transformation to reduce impact: e.g. log or
bins.
• Choose a loss function that is more robust: e.g. MAE
vs MSE.
• Impose a constraint on data range (cap values).
• Ask questions: Understand whether the values are
valid or invalid, to make the most appropriate choice.
What to Do
What Can
Happen
• Outlier values can have a disproportionate weight on
the model.
• MSE will focus on handling outlier observations more
to reduce squared error.
• Boosting will spend considerable modeling effort
fitting these observations.
H2O.ai

Machine Intelligence
Data Leakage
8 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
8. Data Leakage
What Is It
• Leakage is allowing your model to use information
that will not be available in a production setting.
• Obvious example: using the Dow Jones daily gain/
loss as part of a model to predict individual stock
performance (even if that symbol is not part of the
Dow)
• Model is overfit.
• Will make predictions inconsistent with those you
scored when fitting the model (even with a validation
set).
• Insights derived from the model will be incorrect.
• Understand the nature of your problem and data.
• Scrutinize model feedback, such as relative influence
or linear coefficient.
What Happens
What to Do
H2O.ai

Machine Intelligence
Useless Models
9 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
9. Useless Models
What is a
“Useless” Model?
• Solving the Wrong Problem.
• Not collecting appropriate data.
• Not structuring data correctly to solve the problem.
• Choosing a target/loss measure that does not
optimize the end use case: using accuracy to prioritize
resources.
• Having a model that is not actionable.
• Using a complicated model that is less accurate than
a simple model.
• Understand the problem statement.
• Solving the wrong problem is an issue in all problem-
solving domains, but arguably easier with black box
techniques common to ML
• Utilize post-processing measures
• Create simple baseline models to understand lift of more
complex models
• Plan on an iterative approach: start quickly, even if on
imperfect data
• Question your models and attempt to understand them
What To Do
H2O.ai

Machine Intelligence
No Free Lunch
10 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai

Machine Intelligence
10. No Free Lunch
No Such Thing as
a Free Lunch
• No general purpose algorithm to solve all problems.
• No right answer on optimal data preparation.
• General heuristics are not always true:
• Tree models solve problems equivalently with
any order-preserving transformation.
• Decision trees and neural networks will
automatically find interactions.
• High number of predictors may be handled, but
lead to a less optimal result than fewer key
predictors.
• Models can not find relative information that
span multiple observations.
• Model feedback can be misleading: relative
influence, linear coefficients
• Understand how the underlying algorithms operate
• Try several algorithms and observe relative performance
and the characteristics of your data
• Feature engineering & feature selection
• Interpret and react to model feedback
What To Do
H2O.ai

Machine Intelligence
Where to learn practical tips?
• WinVector Blog (Nina Zumel & John Mount): 

http://win-vector.com/blog
• Practical Data Science With R (book by Nina Zumel & John Mount): 

https://www.manning.com/books/practical-data-science-with-r
• Elements of Statistical Learning (book by Trevor Hastie, Robert
Tibshirani & Jerome Friedman): 

http://statweb.stanford.edu/~tibs/ElemStatLearn
• Machine Learning Mastery Blog (Jason Brownlee): 

http://machinelearningmastery.com
H2O.ai

Machine Intelligence
Where to learn more about H2O?
• H2O Online Training (free): http://learn.h2o.ai
• H2O Slidedecks: http://www.slideshare.net/0xdata
• H2O Video Presentations: https://www.youtube.com/user/0xdata
• H2O Community Events & Meetups: http://h2o.ai/events
• Machine Learning & Data Science courses: http://coursebuffet.com
Customers ! Community ! Evangelists
November 9, 10, 11
Computer History Museum

H 2 O W O R L D . H 2 O . A I

!
20% off registration
using code:

h2ocommunity
!
H2O.ai

Machine Intelligence
@ledell on Twitter, GitHub
erin@h2o.ai
http://www.stat.berkeley.edu/~ledell
@Mark_a_Landry on Twitter
mark@h2o.ai

Weitere ähnliche Inhalte

Was ist angesagt?

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 
Python Developer Roadmap 2023
Python Developer Roadmap 2023Python Developer Roadmap 2023
Python Developer Roadmap 2023Simplilearn
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISBabasID2
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyRTTS
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science TeamsGanes Kesari
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersSatyam Jaiswal
 
Data Observability.pptx
Data Observability.pptxData Observability.pptx
Data Observability.pptxSonaSamad1
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data PreprocessingT Kavitha
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 

Was ist angesagt? (20)

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Pandas
PandasPandas
Pandas
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Python Developer Roadmap 2023
Python Developer Roadmap 2023Python Developer Roadmap 2023
Python Developer Roadmap 2023
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science Teams
 
Tableau Presentation
Tableau PresentationTableau Presentation
Tableau Presentation
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & Answers
 
Data Observability.pptx
Data Observability.pptxData Observability.pptx
Data Observability.pptx
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data modelling 101
Data modelling 101Data modelling 101
Data modelling 101
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 

Andere mochten auch

Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityGeoffrey Fox
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 
Big Data Jujitsu Walkthru Client x Client
Big Data Jujitsu Walkthru Client x ClientBig Data Jujitsu Walkthru Client x Client
Big Data Jujitsu Walkthru Client x ClientClient X Client
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Sparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public SafetySparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public SafetySri Ambati
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Sri Ambati
 
H2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno CandelH2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno CandelSri Ambati
 
Machine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of ThingsMachine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of ThingsSri Ambati
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFH2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFSri Ambati
 
H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellSri Ambati
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramH2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramSri Ambati
 
Exploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP ChainsExploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP ChainsAjin Abraham
 
RapidMiner: Performance Validation And Visualization
RapidMiner:  Performance Validation And VisualizationRapidMiner:  Performance Validation And Visualization
RapidMiner: Performance Validation And VisualizationRapidmining Content
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
Hacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - WhitepaperHacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - WhitepaperAjin Abraham
 

Andere mochten auch (20)

Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Big Data Jujitsu Walkthru Client x Client
Big Data Jujitsu Walkthru Client x ClientBig Data Jujitsu Walkthru Client x Client
Big Data Jujitsu Walkthru Client x Client
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Sparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public SafetySparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public Safety
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015
 
H2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno CandelH2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno Candel
 
Machine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of ThingsMachine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of Things
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFH2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
 
H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine Udell
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramH2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
 
Exploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP ChainsExploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
 
RapidMiner: Performance Validation And Visualization
RapidMiner:  Performance Validation And VisualizationRapidMiner:  Performance Validation And Visualization
RapidMiner: Performance Validation And Visualization
 
BSidesTO 2016 - Incident Tracking
BSidesTO 2016 - Incident TrackingBSidesTO 2016 - Incident Tracking
BSidesTO 2016 - Incident Tracking
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Hacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - WhitepaperHacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - Whitepaper
 

Ähnlich wie Top 10 Data Science Practitioner Pitfalls

Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentationNaveen Kumar
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxChitrachitrap
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Medical Segmentation Decathalon
Medical Segmentation DecathalonMedical Segmentation Decathalon
Medical Segmentation Decathalonimgcommcall
 

Ähnlich wie Top 10 Data Science Practitioner Pitfalls (20)

Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptx
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Medical Segmentation Decathalon
Medical Segmentation DecathalonMedical Segmentation Decathalon
Medical Segmentation Decathalon
 

Mehr von Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Mehr von Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Kürzlich hochgeladen

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 

Kürzlich hochgeladen (20)

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 

Top 10 Data Science Practitioner Pitfalls

  • 1. H2O.ai
 Machine Intelligence Top 10 Data Science Practitioner Pitfalls Erin LeDell and Mark Landry Silicon Valley Big Data Science September 2015
  • 2. H2O.ai
 Machine Intelligence H2O.ai H2O Company H2O Software • Team: ~35. Founded in 2012, Mountain View, CA • Stanford Math & Systems Engineers • Open Source Software (Apache 2.0 License)
 • Ease of Use via Web Interface • R, Python, Scala, Spark & Hadoop Interfaces • Distributed Algorithms Scale to Big Data
  • 3. H2O.ai
 Machine Intelligence Scientific Advisory Council Dr. Trevor Hastie Dr. Rob Tibshirani Dr. Stephen Boyd • John A. Overdeck Professor of Mathematics, Stanford University • PhD in Statistics, Stanford University • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Co-author with John Chambers, Statistical Models in S • Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar) • Professor of Statistics and Health Research and Policy, Stanford University • PhD in Statistics, Stanford University • COPPS Presidents’ Award recipient • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Author, Regression Shrinkage and Selection via the Lasso • Co-author, An Introduction to the Bootstrap • Professor of Electrical Engineering and Computer Science, Stanford University • PhD in Electrical Engineering and Computer Science, UC Berkeley • Co-author, Convex Optimization • Co-author, Linear Matrix Inequalities in System and Control Theory • Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
  • 4. H2O.ai
 Machine Intelligence What is Data Science? Problem Formulation • Identify an outcome of interest and the type of task: classification / regression / clustering • Identify the potential predictor variables • Identify the independent sampling units • Conduct research experiment (e.g. Clinical Trial) • Collect examples / randomly sample the population • Transform, clean, impute, filter, aggregate data • Prepare the data for machine learning — X, Y • Modeling using a machine learning algorithm (training) • Model evaluation and comparison • Sensitivity & Cost Analysis • Translate results into action items • Feed results into research pipeline Collect & Process Data Machine Learning Insights & Action
  • 5. H2O.ai
 Machine Intelligence Classification Clustering Machine Learning Task Overview • Predict a real-valued response (viral load, weight) • Gaussian, Gamma, Poisson and Tweedie • MSE and R^2 • Multi-class or Binary classification • Ranking • Accuracy and AUC • Unsupervised learning (no training labels) • Partition the data / identify clusters • AIC and BIC Regression
  • 6. H2O.ai
 Machine Intelligence Machine Learning Workflow Source: NLTK Example of a supervised machine learning workflow.
  • 7. H2O.ai
 Machine Intelligence Train vs Test 1 of 10 Top 10 Data Science Practitioner Pitfalls
  • 8. H2O.ai
 Machine Intelligence 1. Train vs Test Training Set vs. Test Set • Partition the original data (randomly or stratified) into a training set and a test set. (e.g. 70/30) • It can be useful to evaluate the training error, but you should not look at training error alone. • Training error is not an estimate of generalization error (on a test set or cross-validated), which is what you should care more about. • Training error vs test error over time is an useful thing to calculate. It can tell you when you start to overfit your model, so it is a useful metric in supervised machine learning. • Be careful of data leakage (from the training set into the test set). • If you are using pooled repeated measures data (vs iid data), you must ensure that all rows associated with a cluster/individual are either in train or test, but not in both. Training Error vs. Test Error Data Leakage
  • 9. H2O.ai
 Machine Intelligence 1. Train vs Test Error Source: Elements of Statistical Learning
  • 10. H2O.ai
 Machine Intelligence Validation Set 2 of 10 Top 10 Data Science Practitioner Pitfalls
  • 11. H2O.ai
 Machine Intelligence 2. Train vs Test vs Valid Training Set vs. Validation Set vs. Test Set • If you have “enough” data and plan to do some model tuning, you should really partition your data into three parts — Training, Validation and Test sets. • There is no general rule for how you should partition the data and it will depend on how strong the signal in your data is, but an example could be: 50% Train, 25% Validation and 25% Test • The validation set is used strictly for model tuning (via validation of models with different parameters) and the test set is used to make a final estimate of the generalization error. Validation is for Model Tuning
  • 12. H2O.ai
 Machine Intelligence Model Performance 3 of 10 Top 10 Data Science Practitioner Pitfalls
  • 13. H2O.ai
 Machine Intelligence 3. Model Performance Test Error • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30) • Train a model using the training set and evaluate performance (a single time) on the test set. • Train & test K models as shown. • Average the model performance over the K test sets. • Report cross- validated metrics. • Regression: R^2, MSE, RMSE • Classification: Accuracy, F1, H-measure, Log-loss • Ranking (Binary Outcome): AUC, Partial AUC K-fold Cross-validation Performance Metrics
  • 14. H2O.ai
 Machine Intelligence Class Imbalance 4 of 10 Top 10 Data Science Practitioner Pitfalls
  • 15. H2O.ai
 Machine Intelligence 4. Class Imbalance Imbalanced Response Variable • A dataset is said to be imbalanced when the binomial or multinomial response variable has one or more classes that are underrepresented in the training data, with respect to the other classes. • This is incredibly common in real-word datasets. • In practice, balanced datasets are the rarity, unless they have been artificially created. • There is no precise definition of what defines an imbalanced vs balanced dataset — the term is vague. • My rule of thumb for binary response: If the minority class makes <10% of the data, this can cause issues. • Advertising — Probability that someone clicks on ad is 
 very low… very very low. • Healthcare & Medicine — Certain diseases or adverse medical conditions are rare. • Fraud Detection — Insurance or credit fraud is rare. Very common Industries
  • 16. H2O.ai
 Machine Intelligence 4. Simple Remedies Artificial Balance • You can balance the training set using sampling. • Notice that we don’t say to balance the test set. The test set represents the true data distribution. The only way to get “honest” model performance on your test set is to use the original, unbalanced, test set. • The same goes for the hold-out sets in cross- validation. For this, you may end up having to write custom code, depending on what software you use. • H2O has a “balance_classes” argument that can be used to do this properly & automatically. • You can manually upsample (or downsample) your minority (or majority) class(es) set either by duplicating (or sub-sampling) rows, or by using row weights. • The SMOTE (Synthetic Minority Oversampling Technique) algorithm generates simulated training examples from the minority class instead of upsampling. Potential Pitfalls Solutions
  • 17. H2O.ai
 Machine Intelligence 4. Advanced Remedies AUC-Maximizing Algorithms • There are ways to tackle this issue more directly. • By using algorithms that optimize a metric that is insensitive to prior class probabilities — for example, Area Under the ROC Curve (AUC). • Many algorithms work by optimizing a metric equivalent or similar to accuracy. If your data is imbalanced, this will not produce a good model since you can have excellent accuracy and poor AUC. Cost-Sensitive Training • Use a cost function to penalize the types of errors you care about most more harshly. • Cost Matrix:
  • 18. H2O.ai
 Machine Intelligence Categorical Data 5 of 10 Top 10 Data Science Practitioner Pitfalls
  • 19. H2O.ai
 Machine Intelligence 5. Categorical Data Real Data • Most real world datasets contain categorical data. • Problems can arise if you have too many categories. • A lot of ML software will place limits on the number of categories allowed in a single column (e.g. 1024) so you may be forced to deal with this whether you like it or not. • When there are high-cardinality categorical columns, often there will be many categories that only occur a small number of times (not very useful). • If you have some hierarchical knowledge about the data, then you may be able to reduce the number of categories by using some sensible higher-level mapping of the categories. • Example: ICD-9 codes — thousands of unique diagnostic and procedure codes. You can map each category to a higher level super-category to reduce the cardinality. Too Many Categories Solutions
  • 20. H2O.ai
 Machine Intelligence 5. Missing Categories Missing Data • There are many approaches to imputing categorical data. The simplest approach is to impute all missing values with the mode (the category that occurs most). • When your data is split into training and test sets, there may be categories that are represented in the training set but not in the test set and vice versa. • If you have expanded your categorical variable into a group of binary indicator columns equal to the number of categories, then new categories in the test set should not cause any problems. Example: If you expand a categorical (cat, dog) into “cat” and “dog” indicator columns and your test set has a “rat” in it, then the value in each of those columns will be 0 — Neither cat nor dog. • If the algorithm you are using (e.g. Random Forest) implicitly uses the categories then you may want to add an “Other” column that all new categories will be grouped into. Training vs. Test Categories New Categories in Test Set
  • 21. H2O.ai
 Machine Intelligence Missing Data 6 of 10 Top 10 Data Science Practitioner Pitfalls
  • 22. H2O.ai
 Machine Intelligence 6. Missing Data Types of Missing Data • Unavailable: Valid for the observation, but not available in the data set. • Removed: Observation quality threshold may have not been reached, and data removed • Not applicable: measurement does not apply to the particular observation (e.g. number of tires on a boat observation) • It depends! Some options: • Ignore entire observation. • Create an binary variable for each predictor to indicate whether the data was missing or not • Segment model based on data availability. • Use alternative algorithm: decision trees accept missing values; linear models typically do not. What to Do
  • 23. H2O.ai
 Machine Intelligence Outliers 7 of 10 Top 10 Data Science Practitioner Pitfalls
  • 24. H2O.ai
 Machine Intelligence 7. Outliers Types of Outliers • Outliers can exist in response or predictors • Valid outliers: rare, extreme events • Invalid outliers: erroneous measurements • Remove observations. • Apply a transformation to reduce impact: e.g. log or bins. • Choose a loss function that is more robust: e.g. MAE vs MSE. • Impose a constraint on data range (cap values). • Ask questions: Understand whether the values are valid or invalid, to make the most appropriate choice. What to Do What Can Happen • Outlier values can have a disproportionate weight on the model. • MSE will focus on handling outlier observations more to reduce squared error. • Boosting will spend considerable modeling effort fitting these observations.
  • 25. H2O.ai
 Machine Intelligence Data Leakage 8 of 10 Top 10 Data Science Practitioner Pitfalls
  • 26. H2O.ai
 Machine Intelligence 8. Data Leakage What Is It • Leakage is allowing your model to use information that will not be available in a production setting. • Obvious example: using the Dow Jones daily gain/ loss as part of a model to predict individual stock performance (even if that symbol is not part of the Dow) • Model is overfit. • Will make predictions inconsistent with those you scored when fitting the model (even with a validation set). • Insights derived from the model will be incorrect. • Understand the nature of your problem and data. • Scrutinize model feedback, such as relative influence or linear coefficient. What Happens What to Do
  • 27. H2O.ai
 Machine Intelligence Useless Models 9 of 10 Top 10 Data Science Practitioner Pitfalls
  • 28. H2O.ai
 Machine Intelligence 9. Useless Models What is a “Useless” Model? • Solving the Wrong Problem. • Not collecting appropriate data. • Not structuring data correctly to solve the problem. • Choosing a target/loss measure that does not optimize the end use case: using accuracy to prioritize resources. • Having a model that is not actionable. • Using a complicated model that is less accurate than a simple model. • Understand the problem statement. • Solving the wrong problem is an issue in all problem- solving domains, but arguably easier with black box techniques common to ML • Utilize post-processing measures • Create simple baseline models to understand lift of more complex models • Plan on an iterative approach: start quickly, even if on imperfect data • Question your models and attempt to understand them What To Do
  • 29. H2O.ai
 Machine Intelligence No Free Lunch 10 of 10 Top 10 Data Science Practitioner Pitfalls
  • 30. H2O.ai
 Machine Intelligence 10. No Free Lunch No Such Thing as a Free Lunch • No general purpose algorithm to solve all problems. • No right answer on optimal data preparation. • General heuristics are not always true: • Tree models solve problems equivalently with any order-preserving transformation. • Decision trees and neural networks will automatically find interactions. • High number of predictors may be handled, but lead to a less optimal result than fewer key predictors. • Models can not find relative information that span multiple observations. • Model feedback can be misleading: relative influence, linear coefficients • Understand how the underlying algorithms operate • Try several algorithms and observe relative performance and the characteristics of your data • Feature engineering & feature selection • Interpret and react to model feedback What To Do
  • 31. H2O.ai
 Machine Intelligence Where to learn practical tips? • WinVector Blog (Nina Zumel & John Mount): 
 http://win-vector.com/blog • Practical Data Science With R (book by Nina Zumel & John Mount): 
 https://www.manning.com/books/practical-data-science-with-r • Elements of Statistical Learning (book by Trevor Hastie, Robert Tibshirani & Jerome Friedman): 
 http://statweb.stanford.edu/~tibs/ElemStatLearn • Machine Learning Mastery Blog (Jason Brownlee): 
 http://machinelearningmastery.com
  • 32. H2O.ai
 Machine Intelligence Where to learn more about H2O? • H2O Online Training (free): http://learn.h2o.ai • H2O Slidedecks: http://www.slideshare.net/0xdata • H2O Video Presentations: https://www.youtube.com/user/0xdata • H2O Community Events & Meetups: http://h2o.ai/events • Machine Learning & Data Science courses: http://coursebuffet.com
  • 33. Customers ! Community ! Evangelists November 9, 10, 11 Computer History Museum H 2 O W O R L D . H 2 O . A I ! 20% off registration using code: h2ocommunity !
  • 34. H2O.ai
 Machine Intelligence @ledell on Twitter, GitHub erin@h2o.ai http://www.stat.berkeley.edu/~ledell @Mark_a_Landry on Twitter mark@h2o.ai