SlideShare ist ein Scribd-Unternehmen logo
1 von 77
Downloaden Sie, um offline zu lesen
SIMPLE RULES FOR BUILDING
ROBUST MACHINE LEARNING MODELSWITH EXAMPLES IN R
Kyriakos Chatzidimitriou
Research Fellow, Aristotle University of Thessaloniki
PhD in Electrical and Computer Engineering
kyrcha@issel.ee.auth.gr
AMA Call
Early Career and Engagement IG
ABOUT ME
• Born in 1978
• 1997-2003, Diploma in Electrical and Computer Engineering, AUTH, GREECE
• 2003-2004, Worked as a developer
• 2004-2006, MSc in Computer Science, CSU, USA
• 2006-2007, Greek Army
• 2007-2012, PhD, AUTH, GREECE
• Reinforcement learning and evolutionary computing mechanisms for autonomous agents
• 2013-Now, Research Fellow, ECE, AUTH
• 2017-Now, co-founder, manager and full stack developer of Cyclopt P.C.
• Spin-off company of AUTH focusing on software analytics
GENERAL CAREER ADVICE
Life is hard and full of problems.
No point thus in meaningless suffering J
To be happy and for the problems
you can choose, choose those that
you like solving.
By working on the 10K hour of
more…
…you will be too good to be ignored
and you will achieve that by focusing on
deep work and working on difficult
problems
Positive feedback loop,
where good things happen
WHAT IS ML (SUPERVISED)
WHAT AM I WORKING ON ML WISE
Deep website aesthetics
AutoML
Continuous Implicit Authentication
Formatting or linting errors
SIMPLE RULE 1: SPLIT YOU DATA IN THREE
GREEK SONG
• «Δεν υπάρχει ευτυχία, που να κόβεται στα τρία…» – “There is no happiness split in
three…”
• Not true for ML
THE THREE SETS
•Training set — Data on which the learning algorithms runs
•Validation set — Used for making decisions: tuning parameters,
selecting features, model complexity
­Test set — Only used for evaluating performance
­ Else data snooping
DO NOT TAKE ANY DECISIONS ON THE TEST SET
- Do not use it for selecting ANYTHING!
- Rely on the validation score only
VISUAL EXAMPLES OF SPLITS
Whole Dataset
Training Val Test
1F 2F 3F 4F 5F 6F 7F 8F 9F 10F Test
Test
60-20-20 split
10-CV
LOOCV
R EXAMPLE
k = 5
results <- numeric(k)
ind <- sample(2, nrow(iris), replace=T, prob=c(0.8, 0.2))
trainval <- iris[ind==1,]
test <- iris[ind !=1,]
cv <- sample(rep(1:k, length.out=nrow(trainval)))
for(i in 1:k) {
trainData <- trainval[cv != i,]
valData <- trainval[cv == i,]
model <- naiveBayes(Species ~ ., data=trainData, laplace = 0)
pred <- predict(model, valData)
results[i] <- Accuracy(pred, valData$Species)
}
print(mean(results))
# after finding the best laplace
finalmodel <- naiveBayes(Species ~ ., data=trainval,
laplace = 0)
pred <- predict(finalmodel, test)
print(Accuracy(pred, test$Species))
Validation - 0.95 vs. 0.91 - Test
SIMPLE RULE 2: SPLIT YOUR DATA IN THREE,
CORRECTLY
1948 US ELECTIONS
1948, Truman vs. Dewey
Newspaper made a phone poll the previous day
Most of Dewey supporters had phones those days
RULE
• Choose validation and test sets to reflect the data you expect to see in the future
• Ideally performance in validation and test sets should be the same
• Example: Let’s say validation set performance is super and test set performance is so
and so
• If from the same distribution:
• You had overfitted the validation set
• If from different distributions
• You had overfitted the validation set
• Test set is harder
• Test set is different
EXAMPLE OF STRATIFIED CV IN R
iris$Species <- as.numeric(iris$Species)
folds <- createFolds(iris$Species, list=FALSE)
#caret
iris$folds <- folds
ddply(iris, 'folds', summarise, prop=mean(Species))
non_strat_folds <- sample(rep(1:10,
length.out=nrow(iris)))
iris$non_strat_folds <- non_strat_folds
ddply(iris, 'non_strat_folds', summarise,
prop=mean(Species))
Things will be (much) worse if the distribution is more skewed
SIMPLE RULE 3: DATASET SIZE IS IMPORTANT
SIZE HEURISTICS
• #1 Good validation set sizes are between 1000 and 10K
• #2 For the training set have at least 10x the VC-dimension
• For a NN is roughly equal to the number of weights
• #3 Popular heuristic for test size should be 30%, less for large problems
• #4 If you have more data, put them in the validation set to reduce overfitting
• #5 The validation set should be large enough to detect differences between
algorithms
• For distinguishing between classifier A with 90% accuracy and B with 90.1% accuracy then 100
validation examples will not do it.
FINANCIAL IMPACT OF 0.1%
Before
10,000,000 searches
1% CTR
100,000 visitors
1% conversion
1,000 purchases
$100
$100,000
After
10,000,000 searches
1.1% CTR
110,000 visitors
1.1% conversion
1,210 purchases
$100
$121,000
+$21,000
Bid
Prediction
Adaptive Content
LEARNING
CURVES AS A
TOOL
SIMPLE RULE 4: CHOOSE ONE METRIC
DIFFERENT METRIC FOR DIFFERENT NEEDS
• The one metrics allows faster iterations and focus
• Are your classes balanced? Use accuracy
• Are your classes imbalanced? Use the F1-score
• Are you doing multilabel classification? Use for example macro-averaged accuracy
• 𝐵"#$%& =
(
)
∑+,(
)
𝐵(𝑇𝑃+, 𝐹𝑃+, 𝑇𝑁+, 𝐹𝑁+)
• B is a binary evaluation metric like Accuracy =
:;<=:><
:;<=?;<=:><=?><
• The application dictates the metric
• Continuous Implicit Authentication: Equal Error Rate
• Combines two metrics: False Acceptance Rate and False Rejection Rate
• Interested both in preventing impostors but also allowing legitimate users
SIMPLE RULE 5: ALWAYS DO YOUR EDA
THE QUESTION TO ASK IN EXPLORATORY DATA
ANALYSIS
• Definition: Exploratory Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics and graphical
representations.
• Do you see what you expect to see?
R COMMANDS
data <- read.csv(file="winequality-red.csv", header=T, sep=";")
head(data) – did I read it OK?
str(data) – Am I satisfied with the datatypes?
dim(data) – Dataset size
summary(data) – Summary statistics, missing values?
table(data$quality) – Distribution of class variable
CORRPLOT(W, METHOD="CIRCLE", TL.COL="BLACK", TL.SRT=45)
install.packages(“corrplot”)
- Check if correlations make sense.
- Decide on dropping uncorrelated
Variables with the class
BOX-PLOTS
g <- list()
j <- 1
long <- melt(data)
for(i in names(data)) {
subdata = long[long$variable == i,]
g[[j]] <- ggplot(data = subdata,
aes(x=variable, y=value)) +
geom_boxplot()
j = j+1
}
grid.arrange(grobs=g, nrow = 2)
- Check outliers
DENSITY PLOTS
g <- list()
j <- 1
for(i in names(data)) {
print(i)
p <- ggplot(data = data,
aes_string(x=i)) +
geom_density()
g[[j]] <- p
j = j+1
}
grid.arrange(grobs=g, nrow = 2)
- Check normally distributed or right/positively skewed
SIMPLE RULE 6: BE CAREFUL WITH DATA
PREPROCESSING AS WELL
IMPUTING
• Imputation: the process of replacing missing data with substituting values
• Calculate statistics on training data, i.e. mean
• Use this mean to replace missing data on both the validation and the testing
datasets
• Same for normalization or standardization
• Normalization sensitive to outliers
PROPROCESSING
EXAMPLES IN R
ind <- sample(3, nrow(data), replace=TRUE,
prob=c(0.6, 0.2, 0.2))
trainData <- data[ind==1,]
valData <- data[ind==2,]
testData <- data[ind==3,]
trainMaxs <- apply(trainData[,1:11], 2, max)
trainMins <- apply(trainData[,1:11], 2, min)
normTrainData <-
sweep(sweep(trainData[,1:11], 2, trainMins, "-"),
2, (trainMaxs - trainMins), "/")
summary(normTrainData)
PROPROCESSING
EXAMPLES IN R
normValData <- sweep(sweep(valData[,1:11],
2, trainMins, "-"), 2, (trainMaxs - trainMins), "/")
Not an issue if data is big and
correct sampling is kept.
SIMPLE RULE 7: DON’T BE UNLUCKY
BE KNOWLEDGEABLE
• Aka how randomness affects results…
• If you don’t want to be unlucky do 10 times the 10-fold cross-validation and
average the averages and get precise estimates
EXAMPLE IN R
results <- numeric(100)
for(i in 1:100) {
ind <- sample(2, nrow(iris), replace=T,
prob=c(0.9, 0.1))
trainData <- iris[ind==1,]
valData <- iris[ind==2,]
model <- naiveBayes(Species ~ .,
data=trainData)
pred <- predict(model, valData)
results[i] <- Accuracy(pred,
valData$Species)
}
• Even in this simple dataset and
scenario….55/100 splits gave perfect score
in one run.
• With simple 10-fold cross-validation I could
have gotten 100% validation accuracy.
• In one run I got 70%...30% difference
based on luck.
SIMPLE RULE 8: TESTING
TEST WHICH MODEL IS SUPERIOR
­Depends on what you are doing:
­ If you work in a single dataset and you are in the industry, probably you
go with the model that has the best metric in the validation data, backed
by the testing data metric
­ If you are doing research you can add statistical testing
­ If you are building ML algorithms and you are comparing different
algorithms on a whole lot of datasets, check J. Demsar’s 2006 JMLR paper
(more than 7K citations)
CHOOSING BETWEEN TWO
• X and Y models, 10-fold CV
• For a given confidence level, we will check whether the actual difference exceeds
the confidence limit
• Decide on a confidence level: 5% or 1%
• Use Wilcoxon test
• Other tests require more assumptions that are valid with large samples
R EXAMPLEresultsMA <- numeric(10)
resultsMB <- numeric(10)
cv <- sample(rep(1:10, nrow(iris)/10))
for(i in 1:10) {
trainData <- iris[cv == i,]
valData <- iris[cv != i,]
model <- naiveBayes(Species ~ ., data=trainData)
pred <- predict(model, valData)
resultsMA[i] <- Accuracy(pred, valData$Species)
ctree = rpart(Species ~ ., data=trainData,
method="class",minsplit = 1, minbucket = 1, cp = -1)
pred <- predict(ctree, valData, type="class")
resultsMB[i] <- Accuracy(pred, valData$Species)
}
wilcoxon.test(resultsMA, resultsMB)
If p value less than confidence level
then there is statistical significance.
SIMPLE RULE 9: TIME IS MONEY
TIME IS MONEY
• Before doing the whole experimentation, play with a small(er) dataset
• What should this data be?
• Representative!!!
• Check all the pipeline, end-to-end
GPU instances
SIMPLE RULE 10: KNOW THY DATA
DO I HAVE ENOUGH DATA?
• Learning curves….
• Else augment
• How can one augment?
IMAGE
AUGMENTATION
https://github.com/aleju/imgaug
SMOTE TO OVERSAMPLE MINORITY CLASS
https://www.researchgate.net/figure/Graphical-representation-of-the-SMOTE-algorithm-a-SMOTE-starts-from-a-set-of-positive_fig2_317489171
MORE ANNOTATIONS
SIMPLE RULE 11: DECIDE ON YOUR GOAL
IS IT INTERPRETABILITY OR PERFORMANCE?
• Decide what are you striving for.
• (Multi)-Collinearity
• X1 = a * X2 + b
• Many different values of the features could predict equally well Y
• Variance Inflation Factor (VIF) test
• 1, no collinearity
• >10, indication of collinearity
• Discussed in: http://kyrcha.info/2019/03/22/on-collinearity-and-feature-selection
R EXAMPLE
Miles per gallon prediction, autompg
dataset
VIF
REMOVE DISPL.
VIF AGAIN
COMPARISON
REMOVE ALL
VIF>10
RIDGE REGRESSION
Regularization gives
preference towards one
Solution over the others.
RESULTS
SIMPLE RULE 12: START BY CHOOSING THE
CORRECT MODEL FOR YOUR PROBLEM
RANDOM FORESTS
• Nice algorithm, works on a lot of dataset (Fernandez-Delgado et al., JMLR, 2014)
• Few important parameters to tune
• Handles multiclass problems (unlike for example SVMs)
• Can handle a mixture of features and scales
SVM
- Nice algorithm, works on a lot of dataset (Fernandez-Delgado et al., JMLR, 2014)
- Robust theory behind it
- Good for binary classification and 1-class classification
­ We use it in Continuous Implicit Authentication
- Can handle sparse data
GRADIENT BOOSTING MACHINES
• Focuses on difficult samples that are hard to learn
• If you have outliers, it will boost them to be the most important points
• So have important outliers and not errors as outliers
• Is more of a black-box, even though it is tree-based
• Needs more tuning
• Easy to overfit
• Mostly better results that RF
DEEP LEARNING
• Choose if you have lots of data and computational resources
• Don’t have to throw away anything. Solves the problem end-to-end.
SIDENOTE: BIAS - VARIANCE
BIAS VS. VARIANCE
Bias: algorithm’s error rate on the training set. Erroneous assumptions in the learning algorithm.
Variance: difference in error rate between training set and validation set. It is caused by
overfitting to the training data and accounting for small fluctuations.
Learning from Data slides:
http://work.caltech.edu/telecourse.html
EXAMPLE
SIMPLE RULE 13: BECOME A
KNOWLEDGEABLE TRADER
BIAS VARIANCE TRADE-OFF HEURISTICS
• #1 High bias => Increase model size (usually with regularization to mitigate high
variance)
• #2 High variance => add training data (usually with a big model to handle them)
TRADE FOR BIAS
• Will reduce (avoidable) bias
• Increase model size (more neurons/layers/trees/depth etc.)
• Add more helpful features
• Reduce/remove regularization (L2/L1/dropout)
• Indifferent
•Add more training data
TRADE FOR VARIANCE
• Reduce variance
•Add more training data
•Add regularization
•Early stopping (NN)
•Remove features
•Decrease model size (prefer regularization)
• Usually big model to handle training data and then add regularization
•Add more helpful features
SIMPLE RULE 14: FINISH OFF WITH AN
ENSEMBLE
ENSEMBLE TECHNIQUES
- By now you’ve built a ton of models
- Bagging: RF
- Boosting: AdaBoost, GBT
- Voting/Averaging
- Stacking
Classifier
Classifier
Classifier
Classifier
Classifier
Final Prediction
PredictionsTraining Data
SIMPLE RULE 15: TUNE
HYPERPARAMETERS…BUT TO A POINT
TUNE THE MOST INFLUENTIAL PARAMETERS
• There is performance to be gained by parameter tuning (Bagnal and Crawley
2017)
• Tons of parameters, we can’t tune them all
• Understand how they influence training + read relevant papers/walkthroughs
• Random forests (Fernandez-Delgado et al., JMLR, 2014)
• mtry: Number of variables randomly sampled as candidates at each split.
• SVM (Fernandez-Delgado et al., JMLR, 2014)
• tuning the regularization and kernel spread
SIMPLE RULE 16: START A WATERFALL LIKE
PROCESS
THE PROCESS
Study the problem
EDA
Define optimization strategy
(validation, test sets and metric)
Feature Engineering
Modelling
EnsemblingError Analysis
GENERAL RULE
THE BASIC RECIPE (BY ANDREW NG)
http://t.co/1Rn6q35Qf2
THANK YOU
For further AMA questions open an issue at:
https://github.com/kyrcha/ama
FURTHER READING
• Personal Experiences
• Various resources over the internet and the years
• ML Yearning: https://www.mlyearning.org/
• Machine Learning from Data course: http://work.caltech.edu/telecourse.html
• Practical Machine Learning with H2O book

Weitere ähnliche Inhalte

Was ist angesagt?

Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...Informatikai Intézet
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
Mimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithmMimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithmCemal Ardil
 
Introductory maths analysis chapter 15 official
Introductory maths analysis   chapter 15 officialIntroductory maths analysis   chapter 15 official
Introductory maths analysis chapter 15 officialEvert Sandye Taasiringan
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionMohitMhapuskar
 
Introductory maths analysis chapter 01 official
Introductory maths analysis   chapter 01 officialIntroductory maths analysis   chapter 01 official
Introductory maths analysis chapter 01 officialEvert Sandye Taasiringan
 
Introductory maths analysis chapter 17 official
Introductory maths analysis   chapter 17 officialIntroductory maths analysis   chapter 17 official
Introductory maths analysis chapter 17 officialEvert Sandye Taasiringan
 
Introductory maths analysis chapter 08 official
Introductory maths analysis   chapter 08 officialIntroductory maths analysis   chapter 08 official
Introductory maths analysis chapter 08 officialEvert Sandye Taasiringan
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostTakami Sato
 
Introductory maths analysis chapter 16 official
Introductory maths analysis   chapter 16 officialIntroductory maths analysis   chapter 16 official
Introductory maths analysis chapter 16 officialEvert Sandye Taasiringan
 
Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891Cleophas Rwemera
 
Introductory maths analysis chapter 07 official
Introductory maths analysis   chapter 07 officialIntroductory maths analysis   chapter 07 official
Introductory maths analysis chapter 07 officialEvert Sandye Taasiringan
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Statistics Notes
Statistics NotesStatistics Notes
Statistics NotesLeah Morgan
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

Was ist angesagt? (19)

Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Mimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithmMimo system-order-reduction-using-real-coded-genetic-algorithm
Mimo system-order-reduction-using-real-coded-genetic-algorithm
 
Introductory maths analysis chapter 15 official
Introductory maths analysis   chapter 15 officialIntroductory maths analysis   chapter 15 official
Introductory maths analysis chapter 15 official
 
eScience SHAP talk
eScience SHAP talkeScience SHAP talk
eScience SHAP talk
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate Regression
 
Introductory maths analysis chapter 01 official
Introductory maths analysis   chapter 01 officialIntroductory maths analysis   chapter 01 official
Introductory maths analysis chapter 01 official
 
Introductory maths analysis chapter 17 official
Introductory maths analysis   chapter 17 officialIntroductory maths analysis   chapter 17 official
Introductory maths analysis chapter 17 official
 
Introductory maths analysis chapter 08 official
Introductory maths analysis   chapter 08 officialIntroductory maths analysis   chapter 08 official
Introductory maths analysis chapter 08 official
 
Chapter 1 - Applications and More Algebra
Chapter 1 - Applications and More AlgebraChapter 1 - Applications and More Algebra
Chapter 1 - Applications and More Algebra
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
Introductory maths analysis chapter 16 official
Introductory maths analysis   chapter 16 officialIntroductory maths analysis   chapter 16 official
Introductory maths analysis chapter 16 official
 
Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891
 
Introductory maths analysis chapter 07 official
Introductory maths analysis   chapter 07 officialIntroductory maths analysis   chapter 07 official
Introductory maths analysis chapter 07 official
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Statistics Notes
Statistics NotesStatistics Notes
Statistics Notes
 
ictir2016
ictir2016ictir2016
ictir2016
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

Ähnlich wie Simple rules for building robust machine learning models

Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingAkos Hajdu
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
Certified Reasoning for Automated Verification
Certified Reasoning for Automated VerificationCertified Reasoning for Automated Verification
Certified Reasoning for Automated VerificationAsankhaya Sharma
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
 
Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...
Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...
Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...University of Maribor
 
Shift-Left Testing: QA in a DevOps World by David Laulusa
Shift-Left Testing: QA in a DevOps World by David LaulusaShift-Left Testing: QA in a DevOps World by David Laulusa
Shift-Left Testing: QA in a DevOps World by David LaulusaQA or the Highway
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Applied Machine Learning for Chemistry II (HSI2020)
Applied Machine Learning for Chemistry II (HSI2020)Applied Machine Learning for Chemistry II (HSI2020)
Applied Machine Learning for Chemistry II (HSI2020)Ichigaku Takigawa
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 

Ähnlich wie Simple rules for building robust machine learning models (20)

Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model Checking
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Catapult DOE Case Study
Catapult DOE Case StudyCatapult DOE Case Study
Catapult DOE Case Study
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Certified Reasoning for Automated Verification
Certified Reasoning for Automated VerificationCertified Reasoning for Automated Verification
Certified Reasoning for Automated Verification
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...
Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...
Solving 100-Digit Challenge with Score 100 by Extended Running Time and Paral...
 
Shift-Left Testing: QA in a DevOps World by David Laulusa
Shift-Left Testing: QA in a DevOps World by David LaulusaShift-Left Testing: QA in a DevOps World by David Laulusa
Shift-Left Testing: QA in a DevOps World by David Laulusa
 
Templates
TemplatesTemplates
Templates
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Applied Machine Learning for Chemistry II (HSI2020)
Applied Machine Learning for Chemistry II (HSI2020)Applied Machine Learning for Chemistry II (HSI2020)
Applied Machine Learning for Chemistry II (HSI2020)
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 

Mehr von Kyriakos Chatzidimitriou

Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημαΣυμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημαKyriakos Chatzidimitriou
 
Advices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attemptAdvices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attemptKyriakos Chatzidimitriou
 
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...Kyriakos Chatzidimitriou
 
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad AuctionsAn Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad AuctionsKyriakos Chatzidimitriou
 
Μια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες MertacorΜια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες MertacorKyriakos Chatzidimitriou
 
A NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State NetworksA NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State NetworksKyriakos Chatzidimitriou
 

Mehr von Kyriakos Chatzidimitriou (7)

Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημαΣυμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
Συμβουλές και στρατηγικές που αποκόμισα από το πρώτο μου εγχείρημα
 
Advices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attemptAdvices and strategies I learned from my first business attempt
Advices and strategies I learned from my first business attempt
 
Ι/Ο Data Εngineering
Ι/Ο Data ΕngineeringΙ/Ο Data Εngineering
Ι/Ο Data Εngineering
 
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
Μηχανισμοί Ενισχυτικής Μάθησης και Εξελικτικής Υπολογιστικής για Αυτόνομους Π...
 
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad AuctionsAn Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
An Adaptive Proportional Value-per-Click Agent for Bidding in Ad Auctions
 
Μια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες MertacorΜια βραδιά στο μέλλον - Οι πράκτορες Mertacor
Μια βραδιά στο μέλλον - Οι πράκτορες Mertacor
 
A NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State NetworksA NEAT Way for Evolving Echo State Networks
A NEAT Way for Evolving Echo State Networks
 

Kürzlich hochgeladen

GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 

Kürzlich hochgeladen (20)

GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 

Simple rules for building robust machine learning models

  • 1. SIMPLE RULES FOR BUILDING ROBUST MACHINE LEARNING MODELSWITH EXAMPLES IN R Kyriakos Chatzidimitriou Research Fellow, Aristotle University of Thessaloniki PhD in Electrical and Computer Engineering kyrcha@issel.ee.auth.gr AMA Call Early Career and Engagement IG
  • 2. ABOUT ME • Born in 1978 • 1997-2003, Diploma in Electrical and Computer Engineering, AUTH, GREECE • 2003-2004, Worked as a developer • 2004-2006, MSc in Computer Science, CSU, USA • 2006-2007, Greek Army • 2007-2012, PhD, AUTH, GREECE • Reinforcement learning and evolutionary computing mechanisms for autonomous agents • 2013-Now, Research Fellow, ECE, AUTH • 2017-Now, co-founder, manager and full stack developer of Cyclopt P.C. • Spin-off company of AUTH focusing on software analytics
  • 3. GENERAL CAREER ADVICE Life is hard and full of problems. No point thus in meaningless suffering J To be happy and for the problems you can choose, choose those that you like solving. By working on the 10K hour of more… …you will be too good to be ignored and you will achieve that by focusing on deep work and working on difficult problems Positive feedback loop, where good things happen
  • 4. WHAT IS ML (SUPERVISED)
  • 5. WHAT AM I WORKING ON ML WISE Deep website aesthetics AutoML Continuous Implicit Authentication Formatting or linting errors
  • 6. SIMPLE RULE 1: SPLIT YOU DATA IN THREE
  • 7. GREEK SONG • «Δεν υπάρχει ευτυχία, που να κόβεται στα τρία…» – “There is no happiness split in three…” • Not true for ML
  • 8. THE THREE SETS •Training set — Data on which the learning algorithms runs •Validation set — Used for making decisions: tuning parameters, selecting features, model complexity ­Test set — Only used for evaluating performance ­ Else data snooping
  • 9. DO NOT TAKE ANY DECISIONS ON THE TEST SET - Do not use it for selecting ANYTHING! - Rely on the validation score only
  • 10. VISUAL EXAMPLES OF SPLITS Whole Dataset Training Val Test 1F 2F 3F 4F 5F 6F 7F 8F 9F 10F Test Test 60-20-20 split 10-CV LOOCV
  • 11. R EXAMPLE k = 5 results <- numeric(k) ind <- sample(2, nrow(iris), replace=T, prob=c(0.8, 0.2)) trainval <- iris[ind==1,] test <- iris[ind !=1,] cv <- sample(rep(1:k, length.out=nrow(trainval))) for(i in 1:k) { trainData <- trainval[cv != i,] valData <- trainval[cv == i,] model <- naiveBayes(Species ~ ., data=trainData, laplace = 0) pred <- predict(model, valData) results[i] <- Accuracy(pred, valData$Species) } print(mean(results)) # after finding the best laplace finalmodel <- naiveBayes(Species ~ ., data=trainval, laplace = 0) pred <- predict(finalmodel, test) print(Accuracy(pred, test$Species)) Validation - 0.95 vs. 0.91 - Test
  • 12. SIMPLE RULE 2: SPLIT YOUR DATA IN THREE, CORRECTLY
  • 13. 1948 US ELECTIONS 1948, Truman vs. Dewey Newspaper made a phone poll the previous day Most of Dewey supporters had phones those days
  • 14. RULE • Choose validation and test sets to reflect the data you expect to see in the future • Ideally performance in validation and test sets should be the same • Example: Let’s say validation set performance is super and test set performance is so and so • If from the same distribution: • You had overfitted the validation set • If from different distributions • You had overfitted the validation set • Test set is harder • Test set is different
  • 15. EXAMPLE OF STRATIFIED CV IN R iris$Species <- as.numeric(iris$Species) folds <- createFolds(iris$Species, list=FALSE) #caret iris$folds <- folds ddply(iris, 'folds', summarise, prop=mean(Species)) non_strat_folds <- sample(rep(1:10, length.out=nrow(iris))) iris$non_strat_folds <- non_strat_folds ddply(iris, 'non_strat_folds', summarise, prop=mean(Species)) Things will be (much) worse if the distribution is more skewed
  • 16. SIMPLE RULE 3: DATASET SIZE IS IMPORTANT
  • 17. SIZE HEURISTICS • #1 Good validation set sizes are between 1000 and 10K • #2 For the training set have at least 10x the VC-dimension • For a NN is roughly equal to the number of weights • #3 Popular heuristic for test size should be 30%, less for large problems • #4 If you have more data, put them in the validation set to reduce overfitting • #5 The validation set should be large enough to detect differences between algorithms • For distinguishing between classifier A with 90% accuracy and B with 90.1% accuracy then 100 validation examples will not do it.
  • 18. FINANCIAL IMPACT OF 0.1% Before 10,000,000 searches 1% CTR 100,000 visitors 1% conversion 1,000 purchases $100 $100,000 After 10,000,000 searches 1.1% CTR 110,000 visitors 1.1% conversion 1,210 purchases $100 $121,000 +$21,000 Bid Prediction Adaptive Content
  • 20. SIMPLE RULE 4: CHOOSE ONE METRIC
  • 21. DIFFERENT METRIC FOR DIFFERENT NEEDS • The one metrics allows faster iterations and focus • Are your classes balanced? Use accuracy • Are your classes imbalanced? Use the F1-score • Are you doing multilabel classification? Use for example macro-averaged accuracy • 𝐵"#$%& = ( ) ∑+,( ) 𝐵(𝑇𝑃+, 𝐹𝑃+, 𝑇𝑁+, 𝐹𝑁+) • B is a binary evaluation metric like Accuracy = :;<=:>< :;<=?;<=:><=?>< • The application dictates the metric • Continuous Implicit Authentication: Equal Error Rate • Combines two metrics: False Acceptance Rate and False Rejection Rate • Interested both in preventing impostors but also allowing legitimate users
  • 22. SIMPLE RULE 5: ALWAYS DO YOUR EDA
  • 23. THE QUESTION TO ASK IN EXPLORATORY DATA ANALYSIS • Definition: Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. • Do you see what you expect to see?
  • 24. R COMMANDS data <- read.csv(file="winequality-red.csv", header=T, sep=";") head(data) – did I read it OK? str(data) – Am I satisfied with the datatypes? dim(data) – Dataset size summary(data) – Summary statistics, missing values? table(data$quality) – Distribution of class variable
  • 25. CORRPLOT(W, METHOD="CIRCLE", TL.COL="BLACK", TL.SRT=45) install.packages(“corrplot”) - Check if correlations make sense. - Decide on dropping uncorrelated Variables with the class
  • 26. BOX-PLOTS g <- list() j <- 1 long <- melt(data) for(i in names(data)) { subdata = long[long$variable == i,] g[[j]] <- ggplot(data = subdata, aes(x=variable, y=value)) + geom_boxplot() j = j+1 } grid.arrange(grobs=g, nrow = 2) - Check outliers
  • 27. DENSITY PLOTS g <- list() j <- 1 for(i in names(data)) { print(i) p <- ggplot(data = data, aes_string(x=i)) + geom_density() g[[j]] <- p j = j+1 } grid.arrange(grobs=g, nrow = 2) - Check normally distributed or right/positively skewed
  • 28. SIMPLE RULE 6: BE CAREFUL WITH DATA PREPROCESSING AS WELL
  • 29. IMPUTING • Imputation: the process of replacing missing data with substituting values • Calculate statistics on training data, i.e. mean • Use this mean to replace missing data on both the validation and the testing datasets • Same for normalization or standardization • Normalization sensitive to outliers
  • 30. PROPROCESSING EXAMPLES IN R ind <- sample(3, nrow(data), replace=TRUE, prob=c(0.6, 0.2, 0.2)) trainData <- data[ind==1,] valData <- data[ind==2,] testData <- data[ind==3,] trainMaxs <- apply(trainData[,1:11], 2, max) trainMins <- apply(trainData[,1:11], 2, min) normTrainData <- sweep(sweep(trainData[,1:11], 2, trainMins, "-"), 2, (trainMaxs - trainMins), "/") summary(normTrainData)
  • 31. PROPROCESSING EXAMPLES IN R normValData <- sweep(sweep(valData[,1:11], 2, trainMins, "-"), 2, (trainMaxs - trainMins), "/") Not an issue if data is big and correct sampling is kept.
  • 32. SIMPLE RULE 7: DON’T BE UNLUCKY
  • 33. BE KNOWLEDGEABLE • Aka how randomness affects results… • If you don’t want to be unlucky do 10 times the 10-fold cross-validation and average the averages and get precise estimates
  • 34. EXAMPLE IN R results <- numeric(100) for(i in 1:100) { ind <- sample(2, nrow(iris), replace=T, prob=c(0.9, 0.1)) trainData <- iris[ind==1,] valData <- iris[ind==2,] model <- naiveBayes(Species ~ ., data=trainData) pred <- predict(model, valData) results[i] <- Accuracy(pred, valData$Species) } • Even in this simple dataset and scenario….55/100 splits gave perfect score in one run. • With simple 10-fold cross-validation I could have gotten 100% validation accuracy. • In one run I got 70%...30% difference based on luck.
  • 35. SIMPLE RULE 8: TESTING
  • 36. TEST WHICH MODEL IS SUPERIOR ­Depends on what you are doing: ­ If you work in a single dataset and you are in the industry, probably you go with the model that has the best metric in the validation data, backed by the testing data metric ­ If you are doing research you can add statistical testing ­ If you are building ML algorithms and you are comparing different algorithms on a whole lot of datasets, check J. Demsar’s 2006 JMLR paper (more than 7K citations)
  • 37. CHOOSING BETWEEN TWO • X and Y models, 10-fold CV • For a given confidence level, we will check whether the actual difference exceeds the confidence limit • Decide on a confidence level: 5% or 1% • Use Wilcoxon test • Other tests require more assumptions that are valid with large samples
  • 38. R EXAMPLEresultsMA <- numeric(10) resultsMB <- numeric(10) cv <- sample(rep(1:10, nrow(iris)/10)) for(i in 1:10) { trainData <- iris[cv == i,] valData <- iris[cv != i,] model <- naiveBayes(Species ~ ., data=trainData) pred <- predict(model, valData) resultsMA[i] <- Accuracy(pred, valData$Species) ctree = rpart(Species ~ ., data=trainData, method="class",minsplit = 1, minbucket = 1, cp = -1) pred <- predict(ctree, valData, type="class") resultsMB[i] <- Accuracy(pred, valData$Species) } wilcoxon.test(resultsMA, resultsMB) If p value less than confidence level then there is statistical significance.
  • 39. SIMPLE RULE 9: TIME IS MONEY
  • 40. TIME IS MONEY • Before doing the whole experimentation, play with a small(er) dataset • What should this data be? • Representative!!! • Check all the pipeline, end-to-end GPU instances
  • 41. SIMPLE RULE 10: KNOW THY DATA
  • 42. DO I HAVE ENOUGH DATA? • Learning curves…. • Else augment • How can one augment?
  • 44. SMOTE TO OVERSAMPLE MINORITY CLASS https://www.researchgate.net/figure/Graphical-representation-of-the-SMOTE-algorithm-a-SMOTE-starts-from-a-set-of-positive_fig2_317489171
  • 46. SIMPLE RULE 11: DECIDE ON YOUR GOAL
  • 47. IS IT INTERPRETABILITY OR PERFORMANCE? • Decide what are you striving for. • (Multi)-Collinearity • X1 = a * X2 + b • Many different values of the features could predict equally well Y • Variance Inflation Factor (VIF) test • 1, no collinearity • >10, indication of collinearity • Discussed in: http://kyrcha.info/2019/03/22/on-collinearity-and-feature-selection
  • 48. R EXAMPLE Miles per gallon prediction, autompg dataset
  • 49. VIF
  • 54. RIDGE REGRESSION Regularization gives preference towards one Solution over the others.
  • 56. SIMPLE RULE 12: START BY CHOOSING THE CORRECT MODEL FOR YOUR PROBLEM
  • 57. RANDOM FORESTS • Nice algorithm, works on a lot of dataset (Fernandez-Delgado et al., JMLR, 2014) • Few important parameters to tune • Handles multiclass problems (unlike for example SVMs) • Can handle a mixture of features and scales
  • 58. SVM - Nice algorithm, works on a lot of dataset (Fernandez-Delgado et al., JMLR, 2014) - Robust theory behind it - Good for binary classification and 1-class classification ­ We use it in Continuous Implicit Authentication - Can handle sparse data
  • 59. GRADIENT BOOSTING MACHINES • Focuses on difficult samples that are hard to learn • If you have outliers, it will boost them to be the most important points • So have important outliers and not errors as outliers • Is more of a black-box, even though it is tree-based • Needs more tuning • Easy to overfit • Mostly better results that RF
  • 60. DEEP LEARNING • Choose if you have lots of data and computational resources • Don’t have to throw away anything. Solves the problem end-to-end.
  • 61. SIDENOTE: BIAS - VARIANCE
  • 62. BIAS VS. VARIANCE Bias: algorithm’s error rate on the training set. Erroneous assumptions in the learning algorithm. Variance: difference in error rate between training set and validation set. It is caused by overfitting to the training data and accounting for small fluctuations. Learning from Data slides: http://work.caltech.edu/telecourse.html
  • 64. SIMPLE RULE 13: BECOME A KNOWLEDGEABLE TRADER
  • 65. BIAS VARIANCE TRADE-OFF HEURISTICS • #1 High bias => Increase model size (usually with regularization to mitigate high variance) • #2 High variance => add training data (usually with a big model to handle them)
  • 66. TRADE FOR BIAS • Will reduce (avoidable) bias • Increase model size (more neurons/layers/trees/depth etc.) • Add more helpful features • Reduce/remove regularization (L2/L1/dropout) • Indifferent •Add more training data
  • 67. TRADE FOR VARIANCE • Reduce variance •Add more training data •Add regularization •Early stopping (NN) •Remove features •Decrease model size (prefer regularization) • Usually big model to handle training data and then add regularization •Add more helpful features
  • 68. SIMPLE RULE 14: FINISH OFF WITH AN ENSEMBLE
  • 69. ENSEMBLE TECHNIQUES - By now you’ve built a ton of models - Bagging: RF - Boosting: AdaBoost, GBT - Voting/Averaging - Stacking Classifier Classifier Classifier Classifier Classifier Final Prediction PredictionsTraining Data
  • 70. SIMPLE RULE 15: TUNE HYPERPARAMETERS…BUT TO A POINT
  • 71. TUNE THE MOST INFLUENTIAL PARAMETERS • There is performance to be gained by parameter tuning (Bagnal and Crawley 2017) • Tons of parameters, we can’t tune them all • Understand how they influence training + read relevant papers/walkthroughs • Random forests (Fernandez-Delgado et al., JMLR, 2014) • mtry: Number of variables randomly sampled as candidates at each split. • SVM (Fernandez-Delgado et al., JMLR, 2014) • tuning the regularization and kernel spread
  • 72. SIMPLE RULE 16: START A WATERFALL LIKE PROCESS
  • 73. THE PROCESS Study the problem EDA Define optimization strategy (validation, test sets and metric) Feature Engineering Modelling EnsemblingError Analysis
  • 75. THE BASIC RECIPE (BY ANDREW NG) http://t.co/1Rn6q35Qf2
  • 76. THANK YOU For further AMA questions open an issue at: https://github.com/kyrcha/ama
  • 77. FURTHER READING • Personal Experiences • Various resources over the internet and the years • ML Yearning: https://www.mlyearning.org/ • Machine Learning from Data course: http://work.caltech.edu/telecourse.html • Practical Machine Learning with H2O book