SlideShare ist ein Scribd-Unternehmen logo
1 von 27
How to Win Machine Learning
competitions By Marios Michailidis
It’s not the destination…it’s the
journey!
What is kaggle
• world's biggest predictive modelling
competition platform
• Half a million members
• Companies host data challenges.
• Usual tasks include:
– Predict topic or sentiment from text.
– Predict species/type from image.
– Predict store/product/area sales
– Marketing response
Inspired by Horse races…!
• At the University of Southampton, an
entrepreneur talked to us about how he was
able to predict the horse races with regression!
Was curious, wanted to learn more
• learned statistical tools (Like SAS, SPSS, R)
• I became more passionate!
• Picked up programming skills
Built KazAnova
• Generated a couple of algorithms and data
techniques and decided to make them public
so that others can gain from it.
• I released it at (http://www.kazanovaforanalytics.com/)
• Named it after ANOVA (Statistics) and
• KAZANI , mom’s last name.
Joined Kaggle!
• Was curious about Kaggle.
• Joined a few contests and learned lots  .
• The community was very open to sharing and
collaboration.
Other interesting tasks!
• predict the correct answer from a science test,
using the Wikipedia.
• Predict which virus has infected a file.
• Predict the NCAA tournament!
• Predict the Higgs Bosson
• Out of many different numerical series predict
which one is the cause
3 Years of modelling competitions
• Over 90 competitions
• Participated with 45 different teams
• 22 top 10 finishes
• 12 times prize winner
• 3 different modelling platforms
• Ranked 1st out of 480,000 data scientists
What's next
• PhD (UCL) about recommender systems.
• More kaggling (but less intense) !
So… what wins competitions?
In short:
• Understand the problem
• Discipline
• try problem-specific things or new approaches
• The hours you put in
• the right tools
• Collaboration.
• Ensembling
More specifically…
● Understand the problem and the function to optimize
● Choose what seems to be a good Algorithm for a problem and
iterate
• Clean the data
• Scale the data
• Make feature transformations
• Make feature derivations
• Use cross validation to
 Make feature selections
 Tune hyper parameters of the algorithm
– Do this for many other algorithms – Always exploiting their
benefits
– Find best way to combine or ensemble the different algorithms
● Different type of models and when to use them
Understand the metric to optimize
● For example in the competition it was AUC (Area Under the roc
Curve).
● This is a ranking metric
● It shows how consistently your good cases have higher score
than your bad cases.
● Not all algorithms optimize the metric you want.
● Common metrics:
– AUC,
– Classification accuracy,
– precision,
– NDCG,
– RMSE
– MAE
– Deviance
Choose the algorithm
● In most cases many algorithms are experimented
before finding the right one(s).
● Those who try more models/parameters have higher
chance to win contests than others.
● Any algorithm that makes sense to be used, should be
used.
In the algorithm: Clean the Data
● The data cleaning step is not independent of the chosen
algorithm. For different algorithms , different cleaning filters should
be applied.
● Treating missing values is really important, certain algorithms
are more forgiving than others
● In other occasions it may make more sense to carefully replace a
missing value with a sensible one (like average of feature), treat it
as separate category or even remove the whole observation.
● Similarly we search for outliers and again other models are more
forgiving, while in others the impact of outliers is detrimental.
● How to decide the best method?
– Try them all or
– Experience and literature, but mostly the first (bold)
In the algorithm: scaling the data
● Certain algorithms cannot deal with unscaled data.
● scale techniques
– Max scaler: Divide each feature with highest absolute value
– Normalization: (subtract mean and divide with standard
deviation)
– Conditional scaling : scale only under certain conditions (e.g
in medicine we tend to scale per subject to make their features
comparable)
In the algorithm: feature transformations
● For certain algorithms there is benefit in changing the features
because they help them converge faster and better.
● Common transformations will include:
1. LOG, SQRT (variable) , smoothens variables
2. Dummies for categorical variables
3. Sparse matrices To be able to compress the data
4. 1st derivatives : To smoothen data.
5. Weights of evidence (transforming variables while using
information of the target variable)
6. Unsupervised methods that reduce dimensionality (SVD, PCA,
ISOMAP, KDTREE, clustering
In the algorithm: feature derivations
● In many problems this is the most important thing. For example:
– Text classification : generate the corpus of words and make TFIDF
– Sounds : convert sounds to frequencies through Fourier
transformations
– Images : make convolution. E.g. break down an image to pixels and
extract different parts of the image.
– Interactions: Really important for some models. For our algorithms
too! E.g. have variables that show if an item is popular AND the
customer likes it.
– Other that makes sense: similarity features, dimensionality
reduction features or even predictions from other models as features.
In the algorithm: Cross-Validation
● This basically means that from my main set , I create RANDOMLY 2
sets. I built (train) my algorithm with the first one (lets call it training
set) and score the other (lets call it validation set). I repeat this
process multiple times and always check how my model performs on
the test set in respect to the metric I want to optimize.
● The process may look like:
1. For 10 (you choose how many X) times
1. Split the set in training (50%-90% of the original data)
2. And validation (50%-10% of the original data)
3. Then fit the algorithm on the training set
4. Score the validation set.
5. Save the result of that scoring in respect to the chosen metric.
2. Calculate the average of these 10 (X) times. That how much you
expect this score in real life an dis generally a good estimate.
● Remember to use a SEED to be bale to replicate these X splits
In Cross-Validation: Do feature selection
● It is unlikely that all features are useful and some of them may be
damaging your model, because…
– They do not provide new information (Colinearity)
– They are just not predictive (just noise)
● There are many ways to do feature selection:
– Run the algorithm and seek an internal measure to retrieve the most
important features (no cross validation)
– Forward selection with or with no cross validation
– Backward selection with or with no cross validation
– Noise injection
– Hybrid of all methods
● Normally a forward method is chosen with cv. That is we add a
feature and then we split the data in the exact X times as we did
before and we check whether our metric improved:
– If yes, the feature remains
– Else, Wiedersehen!
In Cross-Validation: Hyper Parameter Optimization
● This generally takes lots of time depending on the algorithm . For
example in a random forest, the best model would need to be
determined based on some parameters as number of trees, max
depth, minimum cases in a split, features to consider at each split
etc.
● One way to find the best parameters is to manually change one
(e.g. max_depth=10 instead of 9) while you keep everything else
constant. I found using this method helps you understand more
about what would work in a specific dataset.
● Another way is try many possible combinations of hyper
parameters. We normally do that with Grid Search where we
provide an array of all possible values to be trialled with cross
validation (e.g. try max_depth {8,9,10,11} and number of trees
{90,120,200,500}
Train Many Algorithms
● Try to exploit their strengths
● For example, focus on using linear features with linear
regression (e.g. the higher the age the higher the
income) and non-linear ones with Random forest
● Make it so that each model tries to capture something
new or even focus on different part of the data
Ensemble
● Key part (in winning competitions at least) to combine the various
models made .
● Remember, even a crappy model can be useful to some small
extend.
● Possible ways to ensemble:
– Simple average (Model1 prediction + model2 prediction)/2
– Average Ranks for AUC (simple average after converting to rank)
– Manually tune weights with cross validation
– Using Geomean weighted average
– Use Meta-Modelling (also called stack generalization or stacking)
● Check github for a complete example of these methods using the
Amazon comp hosted by kaggle : https://github.com/kaz-
Anova/ensemble_amazon (top 60 rank) .
Different models I have experimented vol 1
● Logistic/Linear/discriminant regression: Fast, Scalable,
Comprehensible, solid under high dimensionality, can be memory-
light. Best when relationships are linear or all features are
categorical. Good for text classification too .
● Random Forests : Probably the best one-off overall algorithm out
there (to my experience) . Fast, Scalable , memory-medium. Best
when all features are numeric-continuous and there are strong non-
linear relationships. Does not cope well with high dimensionality.
● Gradient Boosting (Trees): Less memory intense as forests (as
individual predictors tend to be weaker). Fast, Semi-Scalable,
memory-medium. Is good when forests are good
● Neural Nets (AKA deep Learning): Good for tasks humans are
good at: Image Recognition, sound recognition. Good with categorical
variables too (as they replicate on-and-off signals). Medium-speed,
Scalable, memory-light . Generally good for linear and non-linear
tasks. May take a lot to train depending on structure. Many
parameters to tune. Very prone to over and under fitting.
● Support Vector Machines (SVMs): Medium-Speed, not scalable,
memory intense. Still good at capturing linear and non linear
relationships. Holding the kernel matrix takes too much memory.
Not advisable for data sets bigger than 20k.
● K Nearest Neighbours: Slow (depending on the size), Not easily
scalable, memory-heavy. Good when really defining the good of
the bad is matter of how much he/she looks to specific individuals.
Also good when number of target variables are many as the
similarity measures remain the same across different
observations. Good for text classification too.
● Naïve Bayes : Quick, scalable, memory-ok. Good for quick
classifications on big datasets. Not particularly predictive.
● Factorization Machines: Good gateways between Linear and
non-linear problems. Stand between regressions , Knns and
neural networks. Memory Medium, semi-scalable, medium-speed.
Good for predicting the rating a customer will assign to a pruduct
Different models I have experimented vol 2
Tools vol 1
● Languages : Python, R, Java
● Liblinear : for linear models
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
● LibSvm for Support Vector machines
www.csie.ntu.edu.tw/~cjlin/libsvm/
● Scikit package in python for text classification, random forests
and gradient boosting machines scikit-learn.org/stable/
● Xgboost for fast scalable gradient boosting
https://github.com/tqchen/xgboost
● LightGBM https://github.com/Microsoft/LightGBM
● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear
models
● http://www.heatonresearch.com/encog encog for neural nets
● H2O in R for many models
● LibFm www.libfm.org
● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/
● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/
● Graphchi for factorizations : https://github.com/GraphChi
● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html
● Cxxnet : One of the best implementation of convolutional neural nets out
there. Difficult to install and requires GPU with NVDIA Graphics card.
https://github.com/antinucleon/cxxnet
● RankLib: The best library out there made in java suited for ranking algorithms
(e.g. rank products for customers) that supports optimization fucntions like
NDCG. people.cs.umass.edu/~vdang/ranklib.html
● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne )
for nets. This assumes you have Theano
(http://deeplearning.net/software/theano/ ) or Tensorflow
https://www.tensorflow.org/ .
Tools vol 2
Where to go next to prepare for competitions
● Coursera : https://www.coursera.org/course/ml Andrew’s NG
class
● Kaggle.com : many competitions for learning. For instance:
http://www.kaggle.com/c/titanic-gettingStarted . Look for the
“knowledge flag”
● Very good slides from university of UTAH:
www.cs.utah.edu/~piyush/teaching/cs5350.html
● clopinet.com/challenges/ . Many past predictive modelling
competitions with tutorials.
● Wikipedia. Not to underestimate. Still the best source of
information out there (collectively) .

Weitere ähnliche Inhalte

Was ist angesagt?

Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation MethodSHUBHAM GUPTA
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
boosting 기법 이해 (bagging vs boosting)
boosting 기법 이해 (bagging vs boosting)boosting 기법 이해 (bagging vs boosting)
boosting 기법 이해 (bagging vs boosting)SANG WON PARK
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 

Was ist angesagt? (20)

Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Decision tree
Decision treeDecision tree
Decision tree
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
boosting 기법 이해 (bagging vs boosting)
boosting 기법 이해 (bagging vs boosting)boosting 기법 이해 (bagging vs boosting)
boosting 기법 이해 (bagging vs boosting)
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 

Andere mochten auch

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning modelsExtract Data Conference
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case StudyHackerEarth
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science CompetitionJeong-Yoon Lee
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforceHackerEarth
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingHackerEarth
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningHJ van Veen
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT MinistryJeong-Yoon Lee
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHackerEarth
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation systemHackerEarth
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talentHackerEarth
 

Andere mochten auch (20)

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforce
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and Recruiting
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine Learning
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talent
 

Ähnlich wie How to Win Machine Learning Competitions ?

Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Dori Waldman
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummiesMichael Winer
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsEng Teong Cheah
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning techniqueDishaSinha9
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2OSri Ambati
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learningjagan477830
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptxRaflyRizky2
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 

Ähnlich wie How to Win Machine Learning Competitions ? (20)

Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning technique
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
 
Dnn guidelines
Dnn guidelinesDnn guidelines
Dnn guidelines
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 

Mehr von HackerEarth

How to hire a data scientist recruit page
How to hire a data scientist recruit pageHow to hire a data scientist recruit page
How to hire a data scientist recruit pageHackerEarth
 
Build accurate assessment with question analytics
Build accurate assessment with question analyticsBuild accurate assessment with question analytics
Build accurate assessment with question analyticsHackerEarth
 
Make your assessments more effective with test analytics
Make your assessments more effective with test analyticsMake your assessments more effective with test analytics
Make your assessments more effective with test analyticsHackerEarth
 
How to hire a data scientist
How to hire a data scientistHow to hire a data scientist
How to hire a data scientistHackerEarth
 
Changing landscape of Technical Recruitment
Changing landscape of Technical RecruitmentChanging landscape of Technical Recruitment
Changing landscape of Technical RecruitmentHackerEarth
 
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...HackerEarth
 
How to recruit excellent talent
How to recruit excellent talentHow to recruit excellent talent
How to recruit excellent talentHackerEarth
 
Interpersonal Dynamics at work
Interpersonal Dynamics at workInterpersonal Dynamics at work
Interpersonal Dynamics at workHackerEarth
 
The Power of HR Analytics
The Power of HR AnalyticsThe Power of HR Analytics
The Power of HR AnalyticsHackerEarth
 
Leading change management
Leading change managementLeading change management
Leading change managementHackerEarth
 
Enhancing the employer brand
Enhancing the employer brandEnhancing the employer brand
Enhancing the employer brandHackerEarth
 
Global Hackathon Report
Global Hackathon ReportGlobal Hackathon Report
Global Hackathon ReportHackerEarth
 
How to organize a successful hackathon
How to organize a successful hackathonHow to organize a successful hackathon
How to organize a successful hackathonHackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovationHackerEarth
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?HackerEarth
 
Managing innovation: A Process Overview
Managing innovation: A Process OverviewManaging innovation: A Process Overview
Managing innovation: A Process OverviewHackerEarth
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist? HackerEarth
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Richard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyRichard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyHackerEarth
 
CodeRED Casestudy
CodeRED CasestudyCodeRED Casestudy
CodeRED CasestudyHackerEarth
 

Mehr von HackerEarth (20)

How to hire a data scientist recruit page
How to hire a data scientist recruit pageHow to hire a data scientist recruit page
How to hire a data scientist recruit page
 
Build accurate assessment with question analytics
Build accurate assessment with question analyticsBuild accurate assessment with question analytics
Build accurate assessment with question analytics
 
Make your assessments more effective with test analytics
Make your assessments more effective with test analyticsMake your assessments more effective with test analytics
Make your assessments more effective with test analytics
 
How to hire a data scientist
How to hire a data scientistHow to hire a data scientist
How to hire a data scientist
 
Changing landscape of Technical Recruitment
Changing landscape of Technical RecruitmentChanging landscape of Technical Recruitment
Changing landscape of Technical Recruitment
 
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
 
How to recruit excellent talent
How to recruit excellent talentHow to recruit excellent talent
How to recruit excellent talent
 
Interpersonal Dynamics at work
Interpersonal Dynamics at workInterpersonal Dynamics at work
Interpersonal Dynamics at work
 
The Power of HR Analytics
The Power of HR AnalyticsThe Power of HR Analytics
The Power of HR Analytics
 
Leading change management
Leading change managementLeading change management
Leading change management
 
Enhancing the employer brand
Enhancing the employer brandEnhancing the employer brand
Enhancing the employer brand
 
Global Hackathon Report
Global Hackathon ReportGlobal Hackathon Report
Global Hackathon Report
 
How to organize a successful hackathon
How to organize a successful hackathonHow to organize a successful hackathon
How to organize a successful hackathon
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
Managing innovation: A Process Overview
Managing innovation: A Process OverviewManaging innovation: A Process Overview
Managing innovation: A Process Overview
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Richard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyRichard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief Biography
 
CodeRED Casestudy
CodeRED CasestudyCodeRED Casestudy
CodeRED Casestudy
 

Kürzlich hochgeladen

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Kürzlich hochgeladen (20)

E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

How to Win Machine Learning Competitions ?

  • 1. How to Win Machine Learning competitions By Marios Michailidis It’s not the destination…it’s the journey!
  • 2. What is kaggle • world's biggest predictive modelling competition platform • Half a million members • Companies host data challenges. • Usual tasks include: – Predict topic or sentiment from text. – Predict species/type from image. – Predict store/product/area sales – Marketing response
  • 3. Inspired by Horse races…! • At the University of Southampton, an entrepreneur talked to us about how he was able to predict the horse races with regression!
  • 4. Was curious, wanted to learn more • learned statistical tools (Like SAS, SPSS, R) • I became more passionate! • Picked up programming skills
  • 5. Built KazAnova • Generated a couple of algorithms and data techniques and decided to make them public so that others can gain from it. • I released it at (http://www.kazanovaforanalytics.com/) • Named it after ANOVA (Statistics) and • KAZANI , mom’s last name.
  • 6. Joined Kaggle! • Was curious about Kaggle. • Joined a few contests and learned lots  . • The community was very open to sharing and collaboration.
  • 7. Other interesting tasks! • predict the correct answer from a science test, using the Wikipedia. • Predict which virus has infected a file. • Predict the NCAA tournament! • Predict the Higgs Bosson • Out of many different numerical series predict which one is the cause
  • 8. 3 Years of modelling competitions • Over 90 competitions • Participated with 45 different teams • 22 top 10 finishes • 12 times prize winner • 3 different modelling platforms • Ranked 1st out of 480,000 data scientists
  • 9. What's next • PhD (UCL) about recommender systems. • More kaggling (but less intense) !
  • 10. So… what wins competitions? In short: • Understand the problem • Discipline • try problem-specific things or new approaches • The hours you put in • the right tools • Collaboration. • Ensembling
  • 11. More specifically… ● Understand the problem and the function to optimize ● Choose what seems to be a good Algorithm for a problem and iterate • Clean the data • Scale the data • Make feature transformations • Make feature derivations • Use cross validation to  Make feature selections  Tune hyper parameters of the algorithm – Do this for many other algorithms – Always exploiting their benefits – Find best way to combine or ensemble the different algorithms ● Different type of models and when to use them
  • 12. Understand the metric to optimize ● For example in the competition it was AUC (Area Under the roc Curve). ● This is a ranking metric ● It shows how consistently your good cases have higher score than your bad cases. ● Not all algorithms optimize the metric you want. ● Common metrics: – AUC, – Classification accuracy, – precision, – NDCG, – RMSE – MAE – Deviance
  • 13. Choose the algorithm ● In most cases many algorithms are experimented before finding the right one(s). ● Those who try more models/parameters have higher chance to win contests than others. ● Any algorithm that makes sense to be used, should be used.
  • 14. In the algorithm: Clean the Data ● The data cleaning step is not independent of the chosen algorithm. For different algorithms , different cleaning filters should be applied. ● Treating missing values is really important, certain algorithms are more forgiving than others ● In other occasions it may make more sense to carefully replace a missing value with a sensible one (like average of feature), treat it as separate category or even remove the whole observation. ● Similarly we search for outliers and again other models are more forgiving, while in others the impact of outliers is detrimental. ● How to decide the best method? – Try them all or – Experience and literature, but mostly the first (bold)
  • 15. In the algorithm: scaling the data ● Certain algorithms cannot deal with unscaled data. ● scale techniques – Max scaler: Divide each feature with highest absolute value – Normalization: (subtract mean and divide with standard deviation) – Conditional scaling : scale only under certain conditions (e.g in medicine we tend to scale per subject to make their features comparable)
  • 16. In the algorithm: feature transformations ● For certain algorithms there is benefit in changing the features because they help them converge faster and better. ● Common transformations will include: 1. LOG, SQRT (variable) , smoothens variables 2. Dummies for categorical variables 3. Sparse matrices To be able to compress the data 4. 1st derivatives : To smoothen data. 5. Weights of evidence (transforming variables while using information of the target variable) 6. Unsupervised methods that reduce dimensionality (SVD, PCA, ISOMAP, KDTREE, clustering
  • 17. In the algorithm: feature derivations ● In many problems this is the most important thing. For example: – Text classification : generate the corpus of words and make TFIDF – Sounds : convert sounds to frequencies through Fourier transformations – Images : make convolution. E.g. break down an image to pixels and extract different parts of the image. – Interactions: Really important for some models. For our algorithms too! E.g. have variables that show if an item is popular AND the customer likes it. – Other that makes sense: similarity features, dimensionality reduction features or even predictions from other models as features.
  • 18. In the algorithm: Cross-Validation ● This basically means that from my main set , I create RANDOMLY 2 sets. I built (train) my algorithm with the first one (lets call it training set) and score the other (lets call it validation set). I repeat this process multiple times and always check how my model performs on the test set in respect to the metric I want to optimize. ● The process may look like: 1. For 10 (you choose how many X) times 1. Split the set in training (50%-90% of the original data) 2. And validation (50%-10% of the original data) 3. Then fit the algorithm on the training set 4. Score the validation set. 5. Save the result of that scoring in respect to the chosen metric. 2. Calculate the average of these 10 (X) times. That how much you expect this score in real life an dis generally a good estimate. ● Remember to use a SEED to be bale to replicate these X splits
  • 19. In Cross-Validation: Do feature selection ● It is unlikely that all features are useful and some of them may be damaging your model, because… – They do not provide new information (Colinearity) – They are just not predictive (just noise) ● There are many ways to do feature selection: – Run the algorithm and seek an internal measure to retrieve the most important features (no cross validation) – Forward selection with or with no cross validation – Backward selection with or with no cross validation – Noise injection – Hybrid of all methods ● Normally a forward method is chosen with cv. That is we add a feature and then we split the data in the exact X times as we did before and we check whether our metric improved: – If yes, the feature remains – Else, Wiedersehen!
  • 20. In Cross-Validation: Hyper Parameter Optimization ● This generally takes lots of time depending on the algorithm . For example in a random forest, the best model would need to be determined based on some parameters as number of trees, max depth, minimum cases in a split, features to consider at each split etc. ● One way to find the best parameters is to manually change one (e.g. max_depth=10 instead of 9) while you keep everything else constant. I found using this method helps you understand more about what would work in a specific dataset. ● Another way is try many possible combinations of hyper parameters. We normally do that with Grid Search where we provide an array of all possible values to be trialled with cross validation (e.g. try max_depth {8,9,10,11} and number of trees {90,120,200,500}
  • 21. Train Many Algorithms ● Try to exploit their strengths ● For example, focus on using linear features with linear regression (e.g. the higher the age the higher the income) and non-linear ones with Random forest ● Make it so that each model tries to capture something new or even focus on different part of the data
  • 22. Ensemble ● Key part (in winning competitions at least) to combine the various models made . ● Remember, even a crappy model can be useful to some small extend. ● Possible ways to ensemble: – Simple average (Model1 prediction + model2 prediction)/2 – Average Ranks for AUC (simple average after converting to rank) – Manually tune weights with cross validation – Using Geomean weighted average – Use Meta-Modelling (also called stack generalization or stacking) ● Check github for a complete example of these methods using the Amazon comp hosted by kaggle : https://github.com/kaz- Anova/ensemble_amazon (top 60 rank) .
  • 23. Different models I have experimented vol 1 ● Logistic/Linear/discriminant regression: Fast, Scalable, Comprehensible, solid under high dimensionality, can be memory- light. Best when relationships are linear or all features are categorical. Good for text classification too . ● Random Forests : Probably the best one-off overall algorithm out there (to my experience) . Fast, Scalable , memory-medium. Best when all features are numeric-continuous and there are strong non- linear relationships. Does not cope well with high dimensionality. ● Gradient Boosting (Trees): Less memory intense as forests (as individual predictors tend to be weaker). Fast, Semi-Scalable, memory-medium. Is good when forests are good ● Neural Nets (AKA deep Learning): Good for tasks humans are good at: Image Recognition, sound recognition. Good with categorical variables too (as they replicate on-and-off signals). Medium-speed, Scalable, memory-light . Generally good for linear and non-linear tasks. May take a lot to train depending on structure. Many parameters to tune. Very prone to over and under fitting.
  • 24. ● Support Vector Machines (SVMs): Medium-Speed, not scalable, memory intense. Still good at capturing linear and non linear relationships. Holding the kernel matrix takes too much memory. Not advisable for data sets bigger than 20k. ● K Nearest Neighbours: Slow (depending on the size), Not easily scalable, memory-heavy. Good when really defining the good of the bad is matter of how much he/she looks to specific individuals. Also good when number of target variables are many as the similarity measures remain the same across different observations. Good for text classification too. ● Naïve Bayes : Quick, scalable, memory-ok. Good for quick classifications on big datasets. Not particularly predictive. ● Factorization Machines: Good gateways between Linear and non-linear problems. Stand between regressions , Knns and neural networks. Memory Medium, semi-scalable, medium-speed. Good for predicting the rating a customer will assign to a pruduct Different models I have experimented vol 2
  • 25. Tools vol 1 ● Languages : Python, R, Java ● Liblinear : for linear models http://www.csie.ntu.edu.tw/~cjlin/liblinear/ ● LibSvm for Support Vector machines www.csie.ntu.edu.tw/~cjlin/libsvm/ ● Scikit package in python for text classification, random forests and gradient boosting machines scikit-learn.org/stable/ ● Xgboost for fast scalable gradient boosting https://github.com/tqchen/xgboost ● LightGBM https://github.com/Microsoft/LightGBM ● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear models ● http://www.heatonresearch.com/encog encog for neural nets ● H2O in R for many models
  • 26. ● LibFm www.libfm.org ● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/ ● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/ ● Graphchi for factorizations : https://github.com/GraphChi ● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html ● Cxxnet : One of the best implementation of convolutional neural nets out there. Difficult to install and requires GPU with NVDIA Graphics card. https://github.com/antinucleon/cxxnet ● RankLib: The best library out there made in java suited for ranking algorithms (e.g. rank products for customers) that supports optimization fucntions like NDCG. people.cs.umass.edu/~vdang/ranklib.html ● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne ) for nets. This assumes you have Theano (http://deeplearning.net/software/theano/ ) or Tensorflow https://www.tensorflow.org/ . Tools vol 2
  • 27. Where to go next to prepare for competitions ● Coursera : https://www.coursera.org/course/ml Andrew’s NG class ● Kaggle.com : many competitions for learning. For instance: http://www.kaggle.com/c/titanic-gettingStarted . Look for the “knowledge flag” ● Very good slides from university of UTAH: www.cs.utah.edu/~piyush/teaching/cs5350.html ● clopinet.com/challenges/ . Many past predictive modelling competitions with tutorials. ● Wikipedia. Not to underestimate. Still the best source of information out there (collectively) .