SlideShare ist ein Scribd-Unternehmen logo
1 von 79
Downloaden Sie, um offline zu lesen
“I want to die on Mars but not on impact” 
— Elon Musk, interview with Chris Anderson 
"There are no facts, only interpretations." - Friedrich Nietzsche 
“The shrewd guess, the fertile hypothesis, the courageous leap to a 
tentative conclusion – these are the most valuable coin of the thinker at 
work” -- Jerome Seymour Bruner 
R, Data Wrangling  
Data Science 
Competitions 
@ksankar // doubleclix.wordpress.com 
http://www.bigdatatechcon.com/ 
tutorials.html#RDataWranglingandDataScienceCompetitions 
October 28, 2014
etude 
http://en.wikipedia.org/wiki/%C3%89tude, http://www.etudesdemarche.net/articles/etudes-sectorielles.htm, 
We will focus on “short”, “acquiring 
skill”  “having fun” ! 
http://upload.wikimedia.org/wikipedia/commons/2/26/La_Cour_du_Palais_des_%C3%A9tudes_de_l%E2%80%99%C3%89cole_des_beaux-arts.jpg
Goals  non-goals 
Goals 
¤ iInm Kparogvgele M CLo smkpilles tbityio pnasr ticipating 
¤ Catalyst for you to widen the ML 
experience 
¤ Give you a focused time to work 
thru submissions 
§ Work with me. I will wait 
if you want to catch-up 
§ Submit a few solutions 
¤ Less theory, more usage - let us 
see if this works 
¤ As straightforward as possible 
§ The programs can be 
optimized 
Non-goals 
¡ Go deep into the algorithms 
• We don’t have 
sufficient time. The topic 
can be easily a 5 day 
tutorial ! 
¡ Dive into R internals 
• That is for another day 
¡ A passive talk 
• Nope. Interactive  
hands-on
Activities  Results 
o Activities: 
• Get familiar with the mechanics of Data Science Competitions 
• Explore the intersection of Algorithms, Data, Intelligence, Inference  
Results 
• Discuss Data Science Horse Sense ;o) 
o Results : 
• Submitted entries for 3 competitions 
• Applied Algorithms on Kaggle Data 
• CART, RF 
• Linear Regression 
• SVM 
• Knowledge of Model Evaluation 
• Cross Validation, ROC Curves
About Me 
o Chief Data Scientist at BlackArrow.tv 
o Have been speaking at OSCON, PyCon, Pydata et al 
The 
Nuthead 
band 
! 
o Reviewing Packt Book “Machine Learning with Spark” 
o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” 
o Have done lots of things: 
• Big Data (Retail, Bioinformatics, Financial, AdTech), 
• Written Books (Web 2.0, Wireless, Java,…) 
• Standards, some work in AI, 
• Guest Lecturer at Naval PG School,… 
• Planning MS-CFinance or Statistics 
• Volunteer as Robotics Judge at First Lego league World Competitions 
o @ksankar, doubleclix.wordpress.com
Setup  Data 
R  IDE 
o Install r 
o Install R Studio 
Tutorial Materials 
o Github : https:// 
github.com/xsankar/ 
cloaked-ironman 
o Clone or download zip 
Data 
Setup an account in Kaggle (www.kaggle.com) 
We will be using the data from 3 Kaggle competitions 
① Titanic: Machine Learning from Disaster 
Download data from http://www.kaggle.com/c/titanic-gettingStarted 
Directory ~/cloaked-ironman/titanic-r 
② Predicting Bike Sharing @ Washington DC 
Download data from http://www.kaggle.com/c/bike-sharing-demand/data 
Directory ~/cloaked-ironman/bike 
③ Walmart Store Forecasting 
Download data from 
http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data 
Directory ~/cloaked-ironman/walmart
Agenda [1 of 2] 
o Oct 28 : 8:00-11:30 3 hrs 
o Intro, Goals, Logistics, Setup [10] [8:00-8:10) 
o Anatomy of a Kaggle Competition [40] [8:10-8:50) 
• Competition Mechanics 
• Register, download data, create sub directories 
• Trial Run : Submit Titanic 
o Algorithms for the Amateur Data Scientist [20] [8:50-9:10) 
• Algorithms, Tools  frameworks in perspective 
• “Folk Wisdom” 
o Model Evaluation  Interpretation [20] [9:10 - 9:30) 
• Confusion Matrix, ROC Graph
Agenda [2 of 2] 
o Break (30 Min) [9:30-10:00) 
o Session 1 : The Art of a Competition – Bike Sharing[45 min] [10:00 – 10:45) 
• Dataset Organization 
• Analytics Walkthrough 
• Algorithms – ctree,SVM 
• Feature Extraction 
• Hands-on Analytics programming of the challenge 
• Submit entry 
o Session 2 : The Art of a Competition – Walmart [30 min] [10:45 – 11:15) 
• Dataset Organization 
• Analytics Walkthrough, Transformations 
o Questions : [11:15 – 11:30)
Overload Warning … There is enough material for a week’s training … which is good  bad ! 
Read thru at your pace, refer, ponder  internalize
Close Encounters 
— 1st 
◦ This Tutorial 
— 2nd 
◦ Do More Hands-on Walkthrough 
— 3nd 
◦ Listen To Lectures 
◦ More competitions …
Anatomy Of a Kaggle 
Competition
Kaggle Data Science Competitions 
o Hosts Data Science Competitions 
o Competition Attributes: 
• Dataset 
• Train 
• Test (Submission) 
• Final Evaluation Data Set (We don’t 
see) 
• Rules 
• Time boxed 
• Leaderboard 
• Evaluation function 
• Discussion Forum 
• Private or Public
Titanic 
Passenger 
Metadata 
• Small 
• 3 
Predictors 
• Class 
• Sex 
• Age 
• Survived? 
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic 
http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html 
City Bike Sharing Prediction (Washington DC) 
Walmart Store Forecasting
Train.csv 
Taken 
from 
Titanic 
Passenger 
Manifest 
Variable 
Descrip-on 
Survived 
0-­‐No, 
1=yes 
Pclass 
Passenger 
Class 
( 
1st,2nd,3rd 
) 
Sibsp 
Number 
of 
Siblings/Spouses 
Aboard 
Parch 
Number 
of 
Parents/Children 
Aboard 
Embarked 
Port 
of 
EmbarkaMon 
o C 
= 
Cherbourg 
o Q 
= 
Queenstown 
o S 
= 
Southampton 
Titanic 
Passenger 
Metadata 
• Small 
• 3 
Predictors 
• Class 
• Sex 
• Age 
• Survived?
Test.csv 
Submission 
o 418 lines; 1st column should have 0 or 1 in each line 
o Evaluation: 
• % correctly predicted
Approach 
o This is a classification problem - 0 or 1 
o Comb the forums ! 
o Opportunity for us to try different algorithms  compare them 
• Simple Model 
• CART[Classification  Regression Tree] 
• Greedy, top-down binary, recursive partitioning that divides feature space into sets 
of disjoint rectangular regions 
• RandomForest 
• Different parameters 
• SVM 
• Multiple kernels 
• Table the results 
o sUasyes c ross validation to predict our model performance  correlate with what Kaggle 
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
Simple Model – Our First Submission 
o #1 : Simple Model (M=survived) 
o #2 : Simple Model (F=survived) 
Refer 
to 
1-­‐Intro_to_Kaggle.R 
at 
hTps://github.com/xsankar/cloaked-­‐ironman/ 
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
#3 : Simple CART Model 
o CART (Classification  Regression Tree) 
Refer 
to 
1-­‐Intro_to_Kaggle.R 
at 
hTps://github.com/xsankar/cloaked-­‐ironman/ 
May be better, because we have improved on the survival of 
men ! 
hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassificaMon/Decision_Trees
#4 : Random Forest Model 
Refer 
to 
1-­‐Intro_to_Kaggle.R 
at 
hTps://github.com/xsankar/cloaked-­‐ironman/ 
o https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience 
• Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/ 
o https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests 
o https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
#5 : SVM 
o Multiple Kernels 
o kernel = ‘radial’ #Radial Basis Function 
o Kernel = ‘sigmoid’ 
Refer 
to 
1-­‐Intro_to_Kaggle.R 
at 
hTps://github.com/xsankar/cloaked-­‐ironman/ 
o agconti's blog - Ultimate Titanic ! 
o http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
Feature Engineering - Homework 
o Add attribute : Age 
• In train 714/891 have age; in test 332/418 have age 
• Missing values can be just Mean Age of all passengers 
• We could be more precise and calculate Mean Age based on Title (Ms, 
Mrs, Master et al) 
• Box plot age 
o Add attribute : Mother, Family size et al 
o Feature engineering ideas 
• http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/ 
sharing-experiences-about-data-munging-and-classification-steps-with- 
python 
o hMtotpre:/ /idsteaatss gaut ys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/ 
o And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
What does it mean ? Let us ponder …. 
o We have a training data set representing a domain 
• We reason over the dataset  develop a model to predict outcomes 
o How good is our prediction when it comes to real life scenarios ? 
o The assumption is that the dataset is taken at random 
• Or Is it ? Is there a Sampling Bias ? 
• i.i.d ? Independent ? Identically Distributed ? 
• What about homoscedasticity ? Do they have the same finite variance ? 
o Can we assure that another dataset (from the same domain) will give us the same 
result ? 
o Will our model  it’s parameters remain the same if we get another data set ? 
o How can we evaluate our model ? 
o How can we select the right parameters for a selected model ?
8:50 
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have … 
Algorithms for the 
Amateur Data Scientist 
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any 
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still 
know where his towel is, is clearly a man to be reckoned with.” 
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Data Scientists apply different techniques 
Ref: Anthony’s Kaggle Presentation 
• Support Vector Machine 
• adaBoost 
• Bayesian Networks 
• Decision Trees 
• Ensemble Methods 
• Random Forest 
• Logistic Regression 
• Genetic Algorithms 
• Monte Carlo Methods 
• Principal Component Analysis 
• Kalman Filter 
• Evolutionary Fuzzy Modelling 
• Neural Networks 
Quora 
• http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
Algorithm spectrum 
o Regression 
o Logit 
o CART 
o Ensemble : 
Random 
Forest 
o Clustering 
o KNN 
o Genetic Alg 
o Simulated 
Annealing 
o Collab 
Filtering 
o SVM 
o Kernels 
o SVD 
o NNet 
o Boltzman 
Machine 
o Feature 
Learning 
Machine 
Learning 
Cute 
Math 
Ar0ficial 
Intelligence
Classifying Classifiers 
Statistical 
Structural 
Regression 
Naïve 
Bayes 
Bayesian 
Networks 
Rule-­‐based 
Distance-­‐based 
Neural 
Networks 
Production 
Rules 
Decision 
Trees 
Multi-­‐layer 
Perception 
Functional 
Ensemble 
Nearest 
Neighbor 
Linear 
Spectral 
Wavelet 
kNN 
Random 
Forests 
Learning 
vector 
Quantization 
Logistic 
Regression1 
Boosting 
SVM 
1Max 
Entropy 
Classifier 
Ref: Algorithms of the Intelligent Web, Marmanis  Babenko
Regression 
Classifiers 
Continuous 
Variables 
Categorical 
Variables 
Decision 
Trees 
k-­‐NN(Nearest 
Neighbors) 
Bias 
Variance 
Model Complexity 
Over-fitting 
Bagging 
BoosMng 
CART
Data Science 
“folk knowledge”
Data Science “folk knowledge” (1 of A) 
o If you torture the data long enough, it will confess to anything. – Hal Varian, Computer 
Mediated Transactions 
o Learning = Representation + Evaluation + Optimization 
o It’s Generalization that counts 
• The fundamental goal of machine learning is to generalize beyond the 
examples in the training set 
o Data alone is not enough 
• Induction not deduction - Every learner should embody some knowledge 
or assumptions beyond the data it is given in order to generalize beyond 
it 
o Machine Learning is not magic – one cannot get something from nothing 
• In order to infer, one needs the knobs  the dials 
• One also needs a rich expressive dataset 
A few useful things to know about machine learning - by Pedro Domingos 
http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (2 of A) 
o Over fitting has many faces 
• Bias – Model not strong enough. So the learner has the tendency to learn the 
same wrong things 
• Variance – Learning too much from one dataset; model will fall apart (ie much 
less accurate) on a different dataset 
• Sampling Bias 
o Intuition Fails in high Dimensions –Bellman 
• Blessing of non-conformity  lower effective dimension; many applications 
have examples not uniformly spread but concentrated near a lower dimensional 
manifold eg. Space of digits is much smaller then the space of images 
o Theoretical Guarantees are not What they seem 
• One of the major developments o f recent decades has been the realization that 
we can have guarantees on the results of induction, particularly if we are 
willing to settle for probabilistic guarantees. 
o Feature engineering is the Key 
A few useful things to know about machine learning - by Pedro Domingos 
http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (3 of A) 
o More Data Beats a Cleverer Algorithm 
• Or conversely select algorithms that improve with data 
• Don’t optimize prematurely without getting more data 
o Learn many models, not Just One 
• Ensembles ! – Change the hypothesis space 
• Netflix prize 
• E.g. Bagging, Boosting, Stacking 
o Simplicity Does not necessarily imply Accuracy 
o Representable Does not imply Learnable 
• Just because a function can be represented does not mean 
it can be learned 
o Correlation Does not imply Causation 
o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ 
o A few useful things to know about machine learning - by Pedro Domingos 
§ http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (4 of A) 
o pThlaeu ssiibmlep lest hypothesis that fits the data is also the most 
• Occam’s Razor 
• Don’t go for a 4 layer Neural Network unless 
you have that complex data 
• But that doesn’t also mean that one should 
choose the simplest hypothesis 
• Match the impedance of the domain, data  the 
algorithms 
o Think of over fitting as memorizing as opposed to learning. 
o Data leakage has many forms 
o Sometimes the Absence of Something is Everything 
o [Corollary] Absence of Evidence is not the Evidence of 
Absence 
§ Simple 
Model 
§ High 
Error 
line 
that 
cannot 
be 
compensated 
with 
more 
data 
§ Gets 
to 
a 
lower 
error 
rate 
with 
less 
data 
points 
§ Complex 
Model 
§ Lower 
Error 
Line 
§ But 
needs 
more 
data 
points 
to 
reach 
decent 
error 
New to Machine Learning? Avoid these three mistakes, James Faghmous 
https://medium.com/about-data/73258b3848a4 
Ref: Andrew Ng/Stanford, Yaser S./CalTech
Importance of feature selection  weak models 
o “Good features allow a simple model to beat a complex model”-Ben Lorica1 
o “… using many weak predictors will always be more accurate than using a few 
strong ones …” –Vladimir Vapnik2 
o “A good decision rule is not a simple one, it cannot be described by a very few 
parameters” 2 
o “Machine learning science is not only about computers, but about humans, and 
the unity of logic, emotion, and culture.” 2 
o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well, 
but it can’t surprise you” – Hadley Wickham3 
hTp://radar.oreilly.com/2014/06/streamlining-­‐feature-­‐engineering.html 
hTp://nauMl.us/issue/6/secret-­‐codes/teaching-­‐me-­‐so^ly 
hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-­‐modeling-­‐and-­‐surprises/ 
Updated 
Slide
Check your assumptions 
o sTthaet isdteiccaisl idoinsst raib mutoiodne lo mf athkee su, nisd edrirlyeicntgly d raetlaat ed to the it’s assumptions about the 
o For example, for regression one should check that: 
① Variables are normally distributed 
• Test for normality via visual inspection, skew  kurtosis, outlier inspections via 
plots, z-scores et al 
② There is a linear relationship between the dependent  independent 
variables 
• Inspect residual plots, try quadratic relationships, try log plots et al 
③ Variables are measured without error 
④ Assumption of Homoscedasticity 
§ Homoscedasticity assumes constant or near constant error variance 
§ Check the standard residual plots and look for heteroscedasticity 
§ For example in the figure, left box has the errors scattered randomly around zero; while the 
right two diagrams have the errors unevenly distributed 
Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test, 
http://pareonline.net/getvn.asp?v=8n=2
Data Science “folk knowledge” (5 of A) 
Donald Rumsfeld is an armchair Data Scientist ! 
http://smartorg.com/2013/07/valuepoint19/ 
The 
World 
Knowns 
Unknowns 
You 
UnKnown 
Known 
o Others 
know, 
you 
don’t 
o What 
we 
do 
o Facts, 
outcomes 
or 
scenarios 
we 
have 
not 
encountered, 
nor 
considered 
o “Black 
swans”, 
outliers, 
long 
tails 
of 
probability 
distribuMons 
o Lack 
of 
experience, 
imaginaMon 
o PotenMal 
facts, 
outcomes 
we 
are 
aware, 
but 
not 
with 
certainty 
o StochasMc 
processes, 
ProbabiliMes 
o Known Knowns 
o There are things we know that we know 
o Known Unknowns 
o That is to say, there are things that we 
now know we don't know 
o But there are also Unknown Unknowns 
o There are things we do not know we 
don't know
Data Science “folk knowledge” (6 of A) - Pipeline 
Reason Model Deploy 
o Scalable 
Model 
Deployment 
o Big 
Data 
automation 
 
purpose 
built 
appliances 
(soft/ 
hard) 
o Manage 
SLAs 
 
response 
times 
Collect Store Transform 
o Volume 
o Velocity 
o Streaming 
Data 
o Canonical 
form 
o Data 
catalog 
o Data 
Fabric 
across 
the 
organization 
o Access 
to 
multiple 
sources 
of 
data 
o Think 
Hybrid 
– 
Big 
Data 
Apps, 
Appliances 
 
Infrastructure 
o Metadata 
o Monitor 
counters 
 
Metrics 
o Structured 
vs. 
Multi-­‐ 
structured 
o Flexible 
 
Selectable 
§ Data 
Subsets 
§ Attribute 
sets 
o Refine 
model 
with 
§ Extended 
Data 
subsets 
§ Engineered 
Attribute 
sets 
o Validation 
run 
across 
a 
larger 
data 
set 
Data Management 
Data Science 
Visualize Recommend Predict Explore 
o Dynamic 
Data 
Sets 
o 2 
way 
key-­‐value 
tagging 
of 
datasets 
o Extended 
attribute 
sets 
o Advanced 
Analytics 
o Performance 
o Scalability 
o Refresh 
Latency 
o In-­‐memory 
Analytics 
o Advanced 
Visualization 
o Interactive 
Dashboards 
o Map 
Overlay 
o Infographics 
¤ Bytes to Business 
a.k.a. Build the full 
stack 
¤ Find Relevant Data 
For Business 
¤ Connect the Dots
Velocity 
Volume 
Variety 
Data Science “folk knowledge” (7 of A) 
Context 
Connect 
edness 
Interface 
Inference 
Intelligence 
“Data of unusual size” 
that can't be brute forced 
o Three Amigos 
o Interface = Cognition 
o Intelligence = Compute(CPU)  Computational(GPU) 
o Infer Significance  Causality
Data Science “folk knowledge” (8 of A) 
Jeremy’s Axioms 
o Iteratively explore data 
o Tools 
• Excel Format, Perl, Perl Book 
o Get your head around data 
• Pivot Table 
o Don’t over-complicate 
o If people give you data, don’t assume that you 
need to use all of it 
o Look at pictures ! 
o History of your submissions – keep a tab 
o Don’t be afraid to submit simple solutions 
• We will do this during this workshop 
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/
Data Science “folk knowledge” (9 of A) 
① Common Sense (some features make more sense then others) 
② Carefully read these forums to get a peak at other peoples’ mindset 
③ Visualizations 
④ Train a classifier (e.g. logistic regression) and look at the feature weights 
⑤ Train a decision tree and visualize it 
⑥ Cluster the data and look at what clusters you get out 
⑦ Just look at the raw data 
⑧ Train a simple classifier, see what mistakes it makes 
⑨ Write a classifier using handwritten rules 
⑩ Pick a fancy method that you want to apply (Deep Learning/Nnet) 
-- Maarten Bosma 
-- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
Data Science “folk knowledge” (A of A) 
Lessons from Kaggle Winners 
① Don’t over-fit 
② All predictors are not needed 
• All data rows are not needed, either 
③ Tuning the algorithms will give different results 
④ Reduce the dataset (Average, select transition data,…) 
⑤ Test set  training set can differ 
⑥ Iteratively explore  get your head around data 
⑦ Don’t be afraid to submit simple solutions 
⑧ Keep a tab  history your submissions
The curious case of the Data Scientist 
o Data Scientist is multi-faceted  Contextual 
o Data Scientist should be building Data Products 
o Data Scientist should tell a story 
Data Scientist (noun): Person who is better at 
statistics than any software engineer  better 
at software engineering than any statistician 
Data Scientist (noun): Person who is worse at 
statistics than any statistician  worse at 
software engineering than any software 
engineer 
– Josh Wills (Cloudera) 
– Will Cukierski (Kaggle) 
http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/ 
Large is hard; Infinite is much easier ! 
– Titus Brown
Essential Reading List 
o A few useful things to know about machine learning - by Pedro Domingos 
• http://dl.acm.org/citation.cfm?id=2347755 
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert 
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/ 
lack_of_a_priori_distinctions_wolpert.pdf 
o http://www.no-free-lunch.org/ 
o YC.o Cn trolling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, 
• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y 
%20FDR.pdf 
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe 
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ 
o Avoid these three mistakes, James Faghmo 
• https://medium.com/about-data/73258b3848a4 
o Leakage in Data Mining: Formulation, Detection, and Avoidance 
• http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/ 
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
For your reading  viewing pleasure … An ordered List 
① An Introduction to Statistical Learning 
• http://www-bcf.usc.edu/~‾gareth/ISL/ 
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning 
• http://online.stanford.edu/course/statistical-learning-winter-2014 
③ Prof. Pedro Domingo 
• https://class.coursera.org/machlearning-001/lecture/preview 
④ Prof. Andrew Ng 
• https://class.coursera.org/ml-003/lecture/preview 
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data 
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 
⑥ Mathematicalmonk @ YouTube 
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA 
⑦ The Elements Of Statistical Learning 
• http://statweb.stanford.edu/~‾tibs/ElemStatLearn/ 
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine- 
learning/
Of Models, 
Performance, Evaluation 
 Interpretation 
9:10
Bias/Variance (1 of 2) 
o Model Complexity 
• Complex Model increases the 
training data fit 
• But then it overfits  doesn't 
perform as well with real data 
o Bias vs. Variance 
o Classical diagram 
o From ELSII, By Hastie, Tibshirani  Friedman 
o Bias – Model learns wrong things; not 
complex enough; error gap small; more 
data by itself won’t help 
o Variance – Different dataset will give 
different error rate; over fitted model; 
larger error gap; more data could help 
Learning Curve 
Prediction Error 
Training 
Error 
Ref: Andrew Ng/Stanford, Yaser S./CalTech
Bias/Variance (2 of 2) 
o High Bias 
• Due to Underfitting 
• Add more features 
• More sophisticated model 
• Quadratic Terms, complex equations,… 
• Decrease regularization 
o High Variance 
• Due to Overfitting 
• Use fewer features 
• Use more training sample 
• Increase Regularization 
Learning Curve 
Prediction Error 
Training 
Error 
Ref: Strata 2013 Tutorial by Olivier Grisel 
Need 
more 
features 
or 
more 
complex 
model 
to 
improve 
Need 
more 
data 
to 
improve 
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
Data Partition  
Cross-Validation 
Partition Data ! 
• Training (60%) 
• Validation(20%)  
• “Vault” Test (20%) Data sets 
k-fold Cross-Validation 
• Split data into k equal parts 
• Fit model to k-1 parts  
calculate prediction error on kth 
part 
• Non-overlapping dataset 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+) 
Train 
Validate 
Test 
#2 
#3 
#4 
#5 
#1 
#1 
#2 
#3 
#5 
#4 
#1 
#2 
#4 
#5 
#3 
#1 
#3 
#4 
#5 
#2 
#2 
#3 
#4 
#5 
#1 
K-­‐fold 
CV 
(k=5) 
Train 
Validate
Bootstrap  Bagging 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+) 
Bootstrap 
• Draw datasets (with replacement) and fit model for each dataset 
• Remember : Data Partitioning (#1)  Cross Validation (#2) are without 
replacement 
Bagging (Bootstrap aggregation) 
◦ Average prediction over a collection of 
bootstrap-ed samples, thus reducing 
variance
Boosting 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+) 
◦ “Output 
of 
weak 
classifiers 
into 
a 
powerful 
commiTee” 
◦ Final 
PredicMon 
= 
weighted 
majority 
vote 
◦ Later 
classifiers 
get 
misclassified 
points 
– With 
higher 
weight, 
– So 
they 
are 
forced 
– To 
concentrate 
on 
them 
◦ AdaBoost 
(AdapMveBoosting) 
◦ BoosMng 
vs 
Bagging 
– Bagging 
– 
independent 
trees 
– BoosMng 
– 
successively 
weighted
Random Forests+ 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+) 
◦ Builds 
large 
collecMon 
of 
de-­‐correlated 
trees 
 
averages 
them 
◦ Improves 
Bagging 
by 
selecMng 
i.i.d* 
random 
variables 
for 
spliong 
◦ Simpler 
to 
train 
 
tune 
◦ “Do 
remarkably 
well, 
with 
very 
li@le 
tuning 
required” 
– 
ESLII 
◦ Less 
suscepMble 
to 
over 
fiong 
(than 
boosMng) 
◦ Many 
RF 
implementaMons 
– Original 
version 
-­‐ 
Fortran-­‐77 
! 
By 
Breiman/Cutler 
– Python, 
R, 
Mahout, 
Weka, 
Milk 
(ML 
toolkit 
for 
py), 
matlab 
* i.i.d – independent identically distributed 
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
◦ Two 
Step 
– Develop 
a 
set 
of 
learners 
– Combine 
the 
results 
to 
develop 
a 
composite 
predictor 
◦ Ensemble 
methods 
can 
take 
the 
form 
of: 
– Using 
different 
algorithms, 
– Using 
the 
same 
algorithm 
with 
different 
seongs 
– Assigning 
different 
parts 
of 
the 
dataset 
to 
different 
classifiers 
◦ Bagging 
 
Random 
Forests 
are 
examples 
of 
ensemble 
method 
Ref: Machine Learning In Action 
Ensemble Methods 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+)
Random Forests 
o While Boosting splits based on best among all variables, RF splits based on best among 
randomly chosen variables 
o Simpler because it requires two variables – no. of Predictors (typically √k)  no. of trees 
(500 for large dataset, 150 for smaller) 
o Error prediction 
• For each iteration, predict for dataset that is not in the sample (OOB data) 
• Aggregate OOB predictions 
• Calculate Prediction Error for the aggregate, which is basically the OOB 
estimate of error rate 
• Can use this to search for optimal # of predictors 
• We will see how close this is to the actual error in the Heritage Health Prize 
o Assumes equal cost for mis-prediction. Can add a cost function 
o Proximity matrix  applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 
Statistical Learning from a Regression Perspective : Berk 
A Brief Overview of RF by Dan Steinberg
Regression 
Classifiers 
Continuous 
Variables 
Categorical 
Variables 
Decision 
Trees 
k-­‐NN(Nearest 
Neighbors) 
Bias 
Variance 
Model Complexity 
Over-fitting 
Bagging 
BoosMng 
CART
Model Evaluation  
Interpretation 
Relevant Digression 9:20
Cross Validation 
o Reference: 
• https://www.kaggle.com/wiki/ 
GettingStartedWithPythonForDataScience 
• Chris Clark ‘s blog : 
http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my- 
first-kaggle-entry/ 
• Predicive Modelling in py with scikit-learning, Olivier Grisel Strata 
2013 
• titanic from pycon2014/parallelmaster/An introduction to Predictive 
Modeling in Python
Model Evaluation - Accuracy 
Predicted=1 
Predicted=0 
Actual 
=1 
True+ 
(tp) 
False-­‐ 
(fn) 
– 
Type 
II 
Actual=0 
False+ 
(fp) 
-­‐ 
Type 
I 
True-­‐ 
(tn) 
o Accuracy = 
tp 
+ 
tn 
tp+fp+fn+tn 
o For cases where tn is large compared tp, a degenerate return(false) will be 
very accurate ! 
o Hence the F-measure is a better reflection of the model strength
Model Evaluation – Precision  Recall 
tp 
tp+fp 
• Precision 
• Accuracy 
• Relevancy 
o Precision = How many items we identified are relevant 
o Recall = How many relevant items did we identify 
o Inverse relationship – Tradeoff depends on situations 
• Legal – Coverage is important than correctness 
• Search – Accuracy is more important 
• Fraud 
• Support cost (high fp) vs. wrath of credit card co.(high fn) 
tp 
tp+fn 
• Recall 
• True 
+ve 
Rate 
• Coverage 
• Sensitivity 
• Hit 
Rate 
fp 
fp+tn 
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html 
• Type 
1 
Error 
Rate 
• False 
+ve 
Rate 
• False 
Alarm 
Rate 
• Specificity 
= 
1 
– 
fp 
rate 
• Type 
1 
Error 
= 
fp 
• Type 
2 
Error 
= 
fn 
Predicted=1 
Predicted=0 
Actual 
=1 
True+ 
(tp) 
False-­‐ 
(fn) 
-­‐ 
Type 
II 
Actual=0 
False+ 
(fp) 
-­‐ 
Type 
I 
True-­‐ 
(tn)
Confusion Matrix 
Actual 
Predicted 
C1 
C2 
C3 
C4 
C1 
10 
5 
9 
3 
C2 
4 
20 
3 
7 
C3 
6 
4 
13 
3 
C4 
2 
1 
4 
15 
Correct 
Ones 
(cii) 
Precision 
= 
Columns 
i 
cii 
cij 
Recall 
= 
cii 
cij 
Rows 
j 
Σ 
Σ
Model Evaluation : F-Measure 
Predicted=1 
Predicted=0 
Actual 
=1 
True+ 
(tp) 
False-­‐ 
(fn) 
-­‐ 
Type 
II 
Actual=0 
False+ 
(fp) 
-­‐ 
Type 
I 
True-­‐ 
(tn) 
Precision = tp / (tp+fp) : Recall = tp / (tp+fn) 
F-Measure 
Balanced, Combined, Weighted Harmonic Mean, measures effectiveness 
= 
(β2 
+ 
1)PR 
β2 
P 
+ 
R 
α 
1 
+ 
1 
(1 
– 
α) 
P 
1 
R 
Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
Hands-on Walkthru - Model Evaluation 
712 (80%) 179 
Train 
Test 
891 hTp://cran.r-­‐project.org/web/packages/e1071/vigneTes/ 
svmdoc.pdf 
-­‐ 
model 
eval 
Kappa 
measure 
is 
interesMng 
Refer 
to 
2-­‐Model_EvaluaMon.R 
at 
hTps://github.com/xsankar/cloaked-­‐ironman/
ROC Analysis 
o “How good is my model?” 
o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf 
o “A receiver operating characteristics (ROC) graph is a technique for visualizing, 
organizing and selecting classifiers based on their performance” 
o Much better than evaluating a model based on simple classification accuracy 
o Plots tp rate vs. fp rate
ROC Graph - Discussion 
o NE O= Conservative, Everything 
o H = Liberal, Everything YES 
o Am not making any 
political statement ! 
o F = Ideal 
o G = Worst 
o The diagonal is the chance 
o North West Corner is good 
o South-East is bad 
• For example E 
• Believe it or Not - I have 
actually seen a graph 
with the curve in this 
region ! 
Actual 
=1 
True+ 
(tp) 
False-­‐ 
(fn) 
Actual=0 
False+ 
(fp) 
True-­‐ 
(tn) 
F 
E 
Predicted=1 
Predicted=0 
H 
G
ROC Graph – Clinical Example 
Ifcc 
: 
Measures 
of 
diagnostic 
accuracy: 
basic 
definitions
ROC Graph Walk thru 
Refer 
to 
2-­‐Model_EvaluaMon.R 
at 
hTps://github.com/xsankar/cloaked-­‐ironman/
Break 
10:00 
9:30
10:00 
The Art of a Competition – Session I : 
Bike Sharing at Washington DC
Few interesting Links - Comb the forums 
o Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing 
• Solution by Brandon Harris 
o Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language 
o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-prediction 
o GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm 
o Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare 
o Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r 
o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances 
o Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour 
o Casual  Registered Users : 
http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count 
o RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please 
o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r 
o Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data 
o Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402 
o Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html
Data Organization – train, test  submission 
• datetime - hourly date + timestamp 
• Season 
• 1 = spring, 2 = summer, 3 = fall, 4 = winter 
• holiday - whether the day is considered a holiday 
• workingday - whether the day is neither a weekend nor holiday 
• Weather 
• 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
• 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
• 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, 
Light Rain + Scattered clouds 
• 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
• temp - temperature in Celsius 
• atemp - feels like temperature in Celsius 
• humidity - relative humidity 
• windspeed - wind speed 
• casual - number of non-registered user rentals initiated 
• registered - number of registered user rentals initiated 
• count - number of total rentals
Approach 
o Convert to factors 
o Engineer new features from date 
o Explore other synthetic features
#1 : ctree 
Refer 
to 
3-­‐Session-­‐I-­‐Bikes.R 
at 
hTps://github.com/xsankar/cloaked-­‐ironman/
#2 : Add Month + year
#3 : RandomForest
The Art of a Competition - 
Session II : Walmart Store 
Prediction 
10:45
Data Organization  Approach 
o TBD
Comb the forums 
o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 
8125/first-place-entry/ 
o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 
8023/thank-you-and-2-rank-model 
o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 
8095/r-codes-for-arima-naive-exponential-smoothing-forecasting/ 
o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 
8055/6-bad-models-make-1-good-model-power-of-ensemble-learning/
The Beginning As The 
End 11:15
The Beginning As the End 
— We 
started 
with 
a 
set 
of 
goals 
— Homework 
◦ For 
you 
– Go 
through 
the 
slides 
– Do 
the 
walkthrough 
– Work 
on 
more 
compe00ons 
– Submit 
entries 
to 
Kaggle
References: 
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas 
• http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-learning 
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel 
• http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn 
o Just The Basics, Strata 2013, William Cukierski  Ben Hamner 
• http://strataconf.com/strata2013/public/schedule/detail/27291 
o The Problem of Multiple Testing 
• http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/ 
PIIS1934148209014609.pdf
R, Data Wrangling & Kaggle Data Science Competitions

Weitere ähnliche Inhalte

Was ist angesagt?

The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012OSCON Byrum
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsHugh McCamphill
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionKrist Wongsuphasawat
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsHugh McCamphill
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceTrey Grainger
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationFrank van Harmelen
 

Was ist angesagt? (19)

The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely tests
 
Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4) Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4)
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An Introduction
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely tests
 
Konrad cedem praesi
Konrad cedem praesiKonrad cedem praesi
Konrad cedem praesi
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge Representation
 

Ähnlich wie R, Data Wrangling & Kaggle Data Science Competitions

Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Data Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front LineData Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front LineRosa Romero Gómez, PhD
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and BlockchainKan Yuenyong
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxGreg Makowski
 
D3.js capita selecta
D3.js capita selectaD3.js capita selecta
D3.js capita selectaJoris Klerkx
 
Useful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceUseful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceIla Group
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Steve Kramer
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
From Virtual Reality to Blockchain: Current and Emerging Tech Trends
From Virtual Reality to Blockchain: Current and Emerging Tech TrendsFrom Virtual Reality to Blockchain: Current and Emerging Tech Trends
From Virtual Reality to Blockchain: Current and Emerging Tech TrendsBohyun Kim
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at ScaleBig Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at ScaleChester Parrott
 
AI and Robotics at an Inflection Point
AI and Robotics at an Inflection PointAI and Robotics at an Inflection Point
AI and Robotics at an Inflection PointSteve Omohundro
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 

Ähnlich wie R, Data Wrangling & Kaggle Data Science Competitions (20)

Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Data Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front LineData Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front Line
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptx
 
D3.js capita selecta
D3.js capita selectaD3.js capita selecta
D3.js capita selecta
 
Useful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceUseful Techniques in Artificial Intelligence
Useful Techniques in Artificial Intelligence
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
From Virtual Reality to Blockchain: Current and Emerging Tech Trends
From Virtual Reality to Blockchain: Current and Emerging Tech TrendsFrom Virtual Reality to Blockchain: Current and Emerging Tech Trends
From Virtual Reality to Blockchain: Current and Emerging Tech Trends
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at ScaleBig Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
 
AI and Robotics at an Inflection Point
AI and Robotics at an Inflection PointAI and Robotics at an Inflection Point
AI and Robotics at an Inflection Point
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
UseR 2017
UseR 2017UseR 2017
UseR 2017
 

Mehr von Krishna Sankar

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

Mehr von Krishna Sankar (13)

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Kürzlich hochgeladen

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Kürzlich hochgeladen (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

R, Data Wrangling & Kaggle Data Science Competitions

  • 1. “I want to die on Mars but not on impact” — Elon Musk, interview with Chris Anderson "There are no facts, only interpretations." - Friedrich Nietzsche “The shrewd guess, the fertile hypothesis, the courageous leap to a tentative conclusion – these are the most valuable coin of the thinker at work” -- Jerome Seymour Bruner R, Data Wrangling Data Science Competitions @ksankar // doubleclix.wordpress.com http://www.bigdatatechcon.com/ tutorials.html#RDataWranglingandDataScienceCompetitions October 28, 2014
  • 2. etude http://en.wikipedia.org/wiki/%C3%89tude, http://www.etudesdemarche.net/articles/etudes-sectorielles.htm, We will focus on “short”, “acquiring skill” “having fun” ! http://upload.wikimedia.org/wikipedia/commons/2/26/La_Cour_du_Palais_des_%C3%A9tudes_de_l%E2%80%99%C3%89cole_des_beaux-arts.jpg
  • 3. Goals non-goals Goals ¤ iInm Kparogvgele M CLo smkpilles tbityio pnasr ticipating ¤ Catalyst for you to widen the ML experience ¤ Give you a focused time to work thru submissions § Work with me. I will wait if you want to catch-up § Submit a few solutions ¤ Less theory, more usage - let us see if this works ¤ As straightforward as possible § The programs can be optimized Non-goals ¡ Go deep into the algorithms • We don’t have sufficient time. The topic can be easily a 5 day tutorial ! ¡ Dive into R internals • That is for another day ¡ A passive talk • Nope. Interactive hands-on
  • 4. Activities Results o Activities: • Get familiar with the mechanics of Data Science Competitions • Explore the intersection of Algorithms, Data, Intelligence, Inference Results • Discuss Data Science Horse Sense ;o) o Results : • Submitted entries for 3 competitions • Applied Algorithms on Kaggle Data • CART, RF • Linear Regression • SVM • Knowledge of Model Evaluation • Cross Validation, ROC Curves
  • 5. About Me o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata et al The Nuthead band ! o Reviewing Packt Book “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), • Written Books (Web 2.0, Wireless, Java,…) • Standards, some work in AI, • Guest Lecturer at Naval PG School,… • Planning MS-CFinance or Statistics • Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  • 6. Setup Data R IDE o Install r o Install R Studio Tutorial Materials o Github : https:// github.com/xsankar/ cloaked-ironman o Clone or download zip Data Setup an account in Kaggle (www.kaggle.com) We will be using the data from 3 Kaggle competitions ① Titanic: Machine Learning from Disaster Download data from http://www.kaggle.com/c/titanic-gettingStarted Directory ~/cloaked-ironman/titanic-r ② Predicting Bike Sharing @ Washington DC Download data from http://www.kaggle.com/c/bike-sharing-demand/data Directory ~/cloaked-ironman/bike ③ Walmart Store Forecasting Download data from http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data Directory ~/cloaked-ironman/walmart
  • 7. Agenda [1 of 2] o Oct 28 : 8:00-11:30 3 hrs o Intro, Goals, Logistics, Setup [10] [8:00-8:10) o Anatomy of a Kaggle Competition [40] [8:10-8:50) • Competition Mechanics • Register, download data, create sub directories • Trial Run : Submit Titanic o Algorithms for the Amateur Data Scientist [20] [8:50-9:10) • Algorithms, Tools frameworks in perspective • “Folk Wisdom” o Model Evaluation Interpretation [20] [9:10 - 9:30) • Confusion Matrix, ROC Graph
  • 8. Agenda [2 of 2] o Break (30 Min) [9:30-10:00) o Session 1 : The Art of a Competition – Bike Sharing[45 min] [10:00 – 10:45) • Dataset Organization • Analytics Walkthrough • Algorithms – ctree,SVM • Feature Extraction • Hands-on Analytics programming of the challenge • Submit entry o Session 2 : The Art of a Competition – Walmart [30 min] [10:45 – 11:15) • Dataset Organization • Analytics Walkthrough, Transformations o Questions : [11:15 – 11:30)
  • 9. Overload Warning … There is enough material for a week’s training … which is good bad ! Read thru at your pace, refer, ponder internalize
  • 10. Close Encounters — 1st ◦ This Tutorial — 2nd ◦ Do More Hands-on Walkthrough — 3nd ◦ Listen To Lectures ◦ More competitions …
  • 11. Anatomy Of a Kaggle Competition
  • 12. Kaggle Data Science Competitions o Hosts Data Science Competitions o Competition Attributes: • Dataset • Train • Test (Submission) • Final Evaluation Data Set (We don’t see) • Rules • Time boxed • Leaderboard • Evaluation function • Discussion Forum • Private or Public
  • 13. Titanic Passenger Metadata • Small • 3 Predictors • Class • Sex • Age • Survived? http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html City Bike Sharing Prediction (Washington DC) Walmart Store Forecasting
  • 14. Train.csv Taken from Titanic Passenger Manifest Variable Descrip-on Survived 0-­‐No, 1=yes Pclass Passenger Class ( 1st,2nd,3rd ) Sibsp Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard Embarked Port of EmbarkaMon o C = Cherbourg o Q = Queenstown o S = Southampton Titanic Passenger Metadata • Small • 3 Predictors • Class • Sex • Age • Survived?
  • 15. Test.csv Submission o 418 lines; 1st column should have 0 or 1 in each line o Evaluation: • % correctly predicted
  • 16. Approach o This is a classification problem - 0 or 1 o Comb the forums ! o Opportunity for us to try different algorithms compare them • Simple Model • CART[Classification Regression Tree] • Greedy, top-down binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions • RandomForest • Different parameters • SVM • Multiple kernels • Table the results o sUasyes c ross validation to predict our model performance correlate with what Kaggle http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
  • 17. Simple Model – Our First Submission o #1 : Simple Model (M=survived) o #2 : Simple Model (F=survived) Refer to 1-­‐Intro_to_Kaggle.R at hTps://github.com/xsankar/cloaked-­‐ironman/ https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
  • 18. #3 : Simple CART Model o CART (Classification Regression Tree) Refer to 1-­‐Intro_to_Kaggle.R at hTps://github.com/xsankar/cloaked-­‐ironman/ May be better, because we have improved on the survival of men ! hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassificaMon/Decision_Trees
  • 19. #4 : Random Forest Model Refer to 1-­‐Intro_to_Kaggle.R at hTps://github.com/xsankar/cloaked-­‐ironman/ o https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience • Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/ o https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests o https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
  • 20. #5 : SVM o Multiple Kernels o kernel = ‘radial’ #Radial Basis Function o Kernel = ‘sigmoid’ Refer to 1-­‐Intro_to_Kaggle.R at hTps://github.com/xsankar/cloaked-­‐ironman/ o agconti's blog - Ultimate Titanic ! o http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
  • 21. Feature Engineering - Homework o Add attribute : Age • In train 714/891 have age; in test 332/418 have age • Missing values can be just Mean Age of all passengers • We could be more precise and calculate Mean Age based on Title (Ms, Mrs, Master et al) • Box plot age o Add attribute : Mother, Family size et al o Feature engineering ideas • http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/ sharing-experiences-about-data-munging-and-classification-steps-with- python o hMtotpre:/ /idsteaatss gaut ys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/ o And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
  • 22. What does it mean ? Let us ponder …. o We have a training data set representing a domain • We reason over the dataset develop a model to predict outcomes o How good is our prediction when it comes to real life scenarios ? o The assumption is that the dataset is taken at random • Or Is it ? Is there a Sampling Bias ? • i.i.d ? Independent ? Identically Distributed ? • What about homoscedasticity ? Do they have the same finite variance ? o Can we assure that another dataset (from the same domain) will give us the same result ? o Will our model it’s parameters remain the same if we get another data set ? o How can we evaluate our model ? o How can we select the right parameters for a selected model ?
  • 23. 8:50 Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have … Algorithms for the Amateur Data Scientist “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
  • 24. Data Scientists apply different techniques Ref: Anthony’s Kaggle Presentation • Support Vector Machine • adaBoost • Bayesian Networks • Decision Trees • Ensemble Methods • Random Forest • Logistic Regression • Genetic Algorithms • Monte Carlo Methods • Principal Component Analysis • Kalman Filter • Evolutionary Fuzzy Modelling • Neural Networks Quora • http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
  • 25. Algorithm spectrum o Regression o Logit o CART o Ensemble : Random Forest o Clustering o KNN o Genetic Alg o Simulated Annealing o Collab Filtering o SVM o Kernels o SVD o NNet o Boltzman Machine o Feature Learning Machine Learning Cute Math Ar0ficial Intelligence
  • 26. Classifying Classifiers Statistical Structural Regression Naïve Bayes Bayesian Networks Rule-­‐based Distance-­‐based Neural Networks Production Rules Decision Trees Multi-­‐layer Perception Functional Ensemble Nearest Neighbor Linear Spectral Wavelet kNN Random Forests Learning vector Quantization Logistic Regression1 Boosting SVM 1Max Entropy Classifier Ref: Algorithms of the Intelligent Web, Marmanis Babenko
  • 27. Regression Classifiers Continuous Variables Categorical Variables Decision Trees k-­‐NN(Nearest Neighbors) Bias Variance Model Complexity Over-fitting Bagging BoosMng CART
  • 28. Data Science “folk knowledge”
  • 29. Data Science “folk knowledge” (1 of A) o If you torture the data long enough, it will confess to anything. – Hal Varian, Computer Mediated Transactions o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts • The fundamental goal of machine learning is to generalize beyond the examples in the training set o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs the dials • One also needs a rich expressive dataset A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 30. Data Science “folk knowledge” (2 of A) o Over fitting has many faces • Bias – Model not strong enough. So the learner has the tendency to learn the same wrong things • Variance – Learning too much from one dataset; model will fall apart (ie much less accurate) on a different dataset • Sampling Bias o Intuition Fails in high Dimensions –Bellman • Blessing of non-conformity lower effective dimension; many applications have examples not uniformly spread but concentrated near a lower dimensional manifold eg. Space of digits is much smaller then the space of images o Theoretical Guarantees are not What they seem • One of the major developments o f recent decades has been the realization that we can have guarantees on the results of induction, particularly if we are willing to settle for probabilistic guarantees. o Feature engineering is the Key A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 31. Data Science “folk knowledge” (3 of A) o More Data Beats a Cleverer Algorithm • Or conversely select algorithms that improve with data • Don’t optimize prematurely without getting more data o Learn many models, not Just One • Ensembles ! – Change the hypothesis space • Netflix prize • E.g. Bagging, Boosting, Stacking o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable • Just because a function can be represented does not mean it can be learned o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o A few useful things to know about machine learning - by Pedro Domingos § http://dl.acm.org/citation.cfm?id=2347755
  • 32. Data Science “folk knowledge” (4 of A) o pThlaeu ssiibmlep lest hypothesis that fits the data is also the most • Occam’s Razor • Don’t go for a 4 layer Neural Network unless you have that complex data • But that doesn’t also mean that one should choose the simplest hypothesis • Match the impedance of the domain, data the algorithms o Think of over fitting as memorizing as opposed to learning. o Data leakage has many forms o Sometimes the Absence of Something is Everything o [Corollary] Absence of Evidence is not the Evidence of Absence § Simple Model § High Error line that cannot be compensated with more data § Gets to a lower error rate with less data points § Complex Model § Lower Error Line § But needs more data points to reach decent error New to Machine Learning? Avoid these three mistakes, James Faghmous https://medium.com/about-data/73258b3848a4 Ref: Andrew Ng/Stanford, Yaser S./CalTech
  • 33. Importance of feature selection weak models o “Good features allow a simple model to beat a complex model”-Ben Lorica1 o “… using many weak predictors will always be more accurate than using a few strong ones …” –Vladimir Vapnik2 o “A good decision rule is not a simple one, it cannot be described by a very few parameters” 2 o “Machine learning science is not only about computers, but about humans, and the unity of logic, emotion, and culture.” 2 o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you” – Hadley Wickham3 hTp://radar.oreilly.com/2014/06/streamlining-­‐feature-­‐engineering.html hTp://nauMl.us/issue/6/secret-­‐codes/teaching-­‐me-­‐so^ly hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-­‐modeling-­‐and-­‐surprises/ Updated Slide
  • 34. Check your assumptions o sTthaet isdteiccaisl idoinsst raib mutoiodne lo mf athkee su, nisd edrirlyeicntgly d raetlaat ed to the it’s assumptions about the o For example, for regression one should check that: ① Variables are normally distributed • Test for normality via visual inspection, skew kurtosis, outlier inspections via plots, z-scores et al ② There is a linear relationship between the dependent independent variables • Inspect residual plots, try quadratic relationships, try log plots et al ③ Variables are measured without error ④ Assumption of Homoscedasticity § Homoscedasticity assumes constant or near constant error variance § Check the standard residual plots and look for heteroscedasticity § For example in the figure, left box has the errors scattered randomly around zero; while the right two diagrams have the errors unevenly distributed Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test, http://pareonline.net/getvn.asp?v=8n=2
  • 35. Data Science “folk knowledge” (5 of A) Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns Unknowns You UnKnown Known o Others know, you don’t o What we do o Facts, outcomes or scenarios we have not encountered, nor considered o “Black swans”, outliers, long tails of probability distribuMons o Lack of experience, imaginaMon o PotenMal facts, outcomes we are aware, but not with certainty o StochasMc processes, ProbabiliMes o Known Knowns o There are things we know that we know o Known Unknowns o That is to say, there are things that we now know we don't know o But there are also Unknown Unknowns o There are things we do not know we don't know
  • 36. Data Science “folk knowledge” (6 of A) - Pipeline Reason Model Deploy o Scalable Model Deployment o Big Data automation purpose built appliances (soft/ hard) o Manage SLAs response times Collect Store Transform o Volume o Velocity o Streaming Data o Canonical form o Data catalog o Data Fabric across the organization o Access to multiple sources of data o Think Hybrid – Big Data Apps, Appliances Infrastructure o Metadata o Monitor counters Metrics o Structured vs. Multi-­‐ structured o Flexible Selectable § Data Subsets § Attribute sets o Refine model with § Extended Data subsets § Engineered Attribute sets o Validation run across a larger data set Data Management Data Science Visualize Recommend Predict Explore o Dynamic Data Sets o 2 way key-­‐value tagging of datasets o Extended attribute sets o Advanced Analytics o Performance o Scalability o Refresh Latency o In-­‐memory Analytics o Advanced Visualization o Interactive Dashboards o Map Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  • 37. Velocity Volume Variety Data Science “folk knowledge” (7 of A) Context Connect edness Interface Inference Intelligence “Data of unusual size” that can't be brute forced o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) Computational(GPU) o Infer Significance Causality
  • 38. Data Science “folk knowledge” (8 of A) Jeremy’s Axioms o Iteratively explore data o Tools • Excel Format, Perl, Perl Book o Get your head around data • Pivot Table o Don’t over-complicate o If people give you data, don’t assume that you need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions • We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/
  • 39. Data Science “folk knowledge” (9 of A) ① Common Sense (some features make more sense then others) ② Carefully read these forums to get a peak at other peoples’ mindset ③ Visualizations ④ Train a classifier (e.g. logistic regression) and look at the feature weights ⑤ Train a decision tree and visualize it ⑥ Cluster the data and look at what clusters you get out ⑦ Just look at the raw data ⑧ Train a simple classifier, see what mistakes it makes ⑨ Write a classifier using handwritten rules ⑩ Pick a fancy method that you want to apply (Deep Learning/Nnet) -- Maarten Bosma -- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
  • 40. Data Science “folk knowledge” (A of A) Lessons from Kaggle Winners ① Don’t over-fit ② All predictors are not needed • All data rows are not needed, either ③ Tuning the algorithms will give different results ④ Reduce the dataset (Average, select transition data,…) ⑤ Test set training set can differ ⑥ Iteratively explore get your head around data ⑦ Don’t be afraid to submit simple solutions ⑧ Keep a tab history your submissions
  • 41. The curious case of the Data Scientist o Data Scientist is multi-faceted Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story Data Scientist (noun): Person who is better at statistics than any software engineer better at software engineering than any statistician Data Scientist (noun): Person who is worse at statistics than any statistician worse at software engineering than any software engineer – Josh Wills (Cloudera) – Will Cukierski (Kaggle) http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/ Large is hard; Infinite is much easier ! – Titus Brown
  • 42. Essential Reading List o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755 o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/ lack_of_a_priori_distinctions_wolpert.pdf o http://www.no-free-lunch.org/ o YC.o Cn trolling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, • http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y %20FDR.pdf o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4 o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/ cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
  • 43. For your reading viewing pleasure … An ordered List ① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~‾gareth/ISL/ ② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014 ③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview ④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview ⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 ⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA ⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~‾tibs/ElemStatLearn/ http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine- learning/
  • 44. Of Models, Performance, Evaluation Interpretation 9:10
  • 45. Bias/Variance (1 of 2) o Model Complexity • Complex Model increases the training data fit • But then it overfits doesn't perform as well with real data o Bias vs. Variance o Classical diagram o From ELSII, By Hastie, Tibshirani Friedman o Bias – Model learns wrong things; not complex enough; error gap small; more data by itself won’t help o Variance – Different dataset will give different error rate; over fitted model; larger error gap; more data could help Learning Curve Prediction Error Training Error Ref: Andrew Ng/Stanford, Yaser S./CalTech
  • 46. Bias/Variance (2 of 2) o High Bias • Due to Underfitting • Add more features • More sophisticated model • Quadratic Terms, complex equations,… • Decrease regularization o High Variance • Due to Overfitting • Use fewer features • Use more training sample • Increase Regularization Learning Curve Prediction Error Training Error Ref: Strata 2013 Tutorial by Olivier Grisel Need more features or more complex model to improve Need more data to improve 'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
  • 47. Data Partition Cross-Validation Partition Data ! • Training (60%) • Validation(20%) • “Vault” Test (20%) Data sets k-fold Cross-Validation • Split data into k equal parts • Fit model to k-1 parts calculate prediction error on kth part • Non-overlapping dataset — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+) Train Validate Test #2 #3 #4 #5 #1 #1 #2 #3 #5 #4 #1 #2 #4 #5 #3 #1 #3 #4 #5 #2 #2 #3 #4 #5 #1 K-­‐fold CV (k=5) Train Validate
  • 48. Bootstrap Bagging — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+) Bootstrap • Draw datasets (with replacement) and fit model for each dataset • Remember : Data Partitioning (#1) Cross Validation (#2) are without replacement Bagging (Bootstrap aggregation) ◦ Average prediction over a collection of bootstrap-ed samples, thus reducing variance
  • 49. Boosting — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+) ◦ “Output of weak classifiers into a powerful commiTee” ◦ Final PredicMon = weighted majority vote ◦ Later classifiers get misclassified points – With higher weight, – So they are forced – To concentrate on them ◦ AdaBoost (AdapMveBoosting) ◦ BoosMng vs Bagging – Bagging – independent trees – BoosMng – successively weighted
  • 50. Random Forests+ — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+) ◦ Builds large collecMon of de-­‐correlated trees averages them ◦ Improves Bagging by selecMng i.i.d* random variables for spliong ◦ Simpler to train tune ◦ “Do remarkably well, with very li@le tuning required” – ESLII ◦ Less suscepMble to over fiong (than boosMng) ◦ Many RF implementaMons – Original version -­‐ Fortran-­‐77 ! By Breiman/Cutler – Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab * i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
  • 51. ◦ Two Step – Develop a set of learners – Combine the results to develop a composite predictor ◦ Ensemble methods can take the form of: – Using different algorithms, – Using the same algorithm with different seongs – Assigning different parts of the dataset to different classifiers ◦ Bagging Random Forests are examples of ensemble method Ref: Machine Learning In Action Ensemble Methods — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  • 52. Random Forests o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables o Simpler because it requires two variables – no. of Predictors (typically √k) no. of trees (500 for large dataset, 150 for smaller) o Error prediction • For each iteration, predict for dataset that is not in the sample (OOB data) • Aggregate OOB predictions • Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate • Can use this to search for optimal # of predictors • We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk A Brief Overview of RF by Dan Steinberg
  • 53. Regression Classifiers Continuous Variables Categorical Variables Decision Trees k-­‐NN(Nearest Neighbors) Bias Variance Model Complexity Over-fitting Bagging BoosMng CART
  • 54. Model Evaluation Interpretation Relevant Digression 9:20
  • 55. Cross Validation o Reference: • https://www.kaggle.com/wiki/ GettingStartedWithPythonForDataScience • Chris Clark ‘s blog : http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my- first-kaggle-entry/ • Predicive Modelling in py with scikit-learning, Olivier Grisel Strata 2013 • titanic from pycon2014/parallelmaster/An introduction to Predictive Modeling in Python
  • 56. Model Evaluation - Accuracy Predicted=1 Predicted=0 Actual =1 True+ (tp) False-­‐ (fn) – Type II Actual=0 False+ (fp) -­‐ Type I True-­‐ (tn) o Accuracy = tp + tn tp+fp+fn+tn o For cases where tn is large compared tp, a degenerate return(false) will be very accurate ! o Hence the F-measure is a better reflection of the model strength
  • 57. Model Evaluation – Precision Recall tp tp+fp • Precision • Accuracy • Relevancy o Precision = How many items we identified are relevant o Recall = How many relevant items did we identify o Inverse relationship – Tradeoff depends on situations • Legal – Coverage is important than correctness • Search – Accuracy is more important • Fraud • Support cost (high fp) vs. wrath of credit card co.(high fn) tp tp+fn • Recall • True +ve Rate • Coverage • Sensitivity • Hit Rate fp fp+tn http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html • Type 1 Error Rate • False +ve Rate • False Alarm Rate • Specificity = 1 – fp rate • Type 1 Error = fp • Type 2 Error = fn Predicted=1 Predicted=0 Actual =1 True+ (tp) False-­‐ (fn) -­‐ Type II Actual=0 False+ (fp) -­‐ Type I True-­‐ (tn)
  • 58. Confusion Matrix Actual Predicted C1 C2 C3 C4 C1 10 5 9 3 C2 4 20 3 7 C3 6 4 13 3 C4 2 1 4 15 Correct Ones (cii) Precision = Columns i cii cij Recall = cii cij Rows j Σ Σ
  • 59. Model Evaluation : F-Measure Predicted=1 Predicted=0 Actual =1 True+ (tp) False-­‐ (fn) -­‐ Type II Actual=0 False+ (fp) -­‐ Type I True-­‐ (tn) Precision = tp / (tp+fp) : Recall = tp / (tp+fn) F-Measure Balanced, Combined, Weighted Harmonic Mean, measures effectiveness = (β2 + 1)PR β2 P + R α 1 + 1 (1 – α) P 1 R Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
  • 60. Hands-on Walkthru - Model Evaluation 712 (80%) 179 Train Test 891 hTp://cran.r-­‐project.org/web/packages/e1071/vigneTes/ svmdoc.pdf -­‐ model eval Kappa measure is interesMng Refer to 2-­‐Model_EvaluaMon.R at hTps://github.com/xsankar/cloaked-­‐ironman/
  • 61. ROC Analysis o “How good is my model?” o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf o “A receiver operating characteristics (ROC) graph is a technique for visualizing, organizing and selecting classifiers based on their performance” o Much better than evaluating a model based on simple classification accuracy o Plots tp rate vs. fp rate
  • 62. ROC Graph - Discussion o NE O= Conservative, Everything o H = Liberal, Everything YES o Am not making any political statement ! o F = Ideal o G = Worst o The diagonal is the chance o North West Corner is good o South-East is bad • For example E • Believe it or Not - I have actually seen a graph with the curve in this region ! Actual =1 True+ (tp) False-­‐ (fn) Actual=0 False+ (fp) True-­‐ (tn) F E Predicted=1 Predicted=0 H G
  • 63. ROC Graph – Clinical Example Ifcc : Measures of diagnostic accuracy: basic definitions
  • 64. ROC Graph Walk thru Refer to 2-­‐Model_EvaluaMon.R at hTps://github.com/xsankar/cloaked-­‐ironman/
  • 66. 10:00 The Art of a Competition – Session I : Bike Sharing at Washington DC
  • 67. Few interesting Links - Comb the forums o Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing • Solution by Brandon Harris o Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-prediction o GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm o Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare o Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances o Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour o Casual Registered Users : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count o RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r o Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data o Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402 o Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html
  • 68. Data Organization – train, test submission • datetime - hourly date + timestamp • Season • 1 = spring, 2 = summer, 3 = fall, 4 = winter • holiday - whether the day is considered a holiday • workingday - whether the day is neither a weekend nor holiday • Weather • 1: Clear, Few clouds, Partly cloudy, Partly cloudy • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog • temp - temperature in Celsius • atemp - feels like temperature in Celsius • humidity - relative humidity • windspeed - wind speed • casual - number of non-registered user rentals initiated • registered - number of registered user rentals initiated • count - number of total rentals
  • 69. Approach o Convert to factors o Engineer new features from date o Explore other synthetic features
  • 70. #1 : ctree Refer to 3-­‐Session-­‐I-­‐Bikes.R at hTps://github.com/xsankar/cloaked-­‐ironman/
  • 71. #2 : Add Month + year
  • 73. The Art of a Competition - Session II : Walmart Store Prediction 10:45
  • 74. Data Organization Approach o TBD
  • 75. Comb the forums o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 8125/first-place-entry/ o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 8023/thank-you-and-2-rank-model o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 8095/r-codes-for-arima-naive-exponential-smoothing-forecasting/ o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/ 8055/6-bad-models-make-1-good-model-power-of-ensemble-learning/
  • 76. The Beginning As The End 11:15
  • 77. The Beginning As the End — We started with a set of goals — Homework ◦ For you – Go through the slides – Do the walkthrough – Work on more compe00ons – Submit entries to Kaggle
  • 78. References: o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas • http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-learning o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel • http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn o Just The Basics, Strata 2013, William Cukierski Ben Hamner • http://strataconf.com/strata2013/public/schedule/detail/27291 o The Problem of Multiple Testing • http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/ PIIS1934148209014609.pdf