Who will win Super Bowl XLIX

Who will win XLIX?
R, Data Wrangling &
Data Science
January 18, 2015
@ksankar // doubleclix.wordpress.com
“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
"There are no facts, only interpretations." - Friedrich Nietzsche

etude
http://en.wikipedia.org/wiki/%C3%89tude, http://www.etudesdemarche.net/articles/etudes-sectorielles.htm,
http://upload.wikimedia.org/wikipedia/commons/2/26/La_Cour_du_Palais_des_%C3%A9tudes_de_l%E2%80%99%C3%89cole_des_beaux-arts.jpg
We will focus on “short”, “acquiring
skill” & “having fun” !

Goals & non-goals
Goals
¤ Get familiar with the R
language & dplyr
¤ Work on a couple of interesting
data science problems
¤ Give you a focused time to
work
§ Work with me. I will wait
if you want to catch-up
¤ Less theory, more usage - let
us see if this works
¤ As straightforward as possible
§ The programs can be
optimized
Non-goals
¡ Go deep into the algorithms
•  We don’t have
sufficient time. The topic
can be easily a 5 day
tutorial !
¡ Dive into R internals
•  That is for another day
¡ A passive talk
•  Nope. Interactive &
hands-on

Activities & Results
o  Activities:
•  Get familiar with R, R Studio
•  Work on a couple of data sets
•  Get familiar with the mechanics of Data Science Competitions
•  Explore the intersection of Algorithms, Data, Intelligence, Inference &
Results
•  Discuss Data Science Horse Sense ;o)
o  Results :
•  Hands-on R
•  Familiar with some of the interesting algorithms
•  Submitted entries for 1 competition
•  Knowledge of Model Evaluation
•  Cross Validation, ROC Curves

About Me
o  Chief Data Scientist at BlackArrow.tv
o  Have been speaking at OSCON, PyCon, Pydata et al
o  Reviewing Packt Book “Machine Learning with Spark”
o  Picked up co-authorship Second Edition of “Fast Data Processing with Spark”
o  Have done lots of things:
•  Big Data (Retail, Bioinformatics, Financial, AdTech),
•  Written Books (Web 2.0, Wireless, Java,…)
•  Standards, some work in AI,
•  Guest Lecturer at Naval PG School,…
•  Planning MS-CFinance or Statistics
•  Volunteer as Robotics Judge at First Lego league World Competitions
o  @ksankar, doubleclix.wordpress.com
The
Nuthead
band
!

Setup & Data
R & IDE
o  Install R
o  Install R Studio
Tutorial Materials
o  Github : https://
github.com/xsankar/
hairy-octo-hipster
o  Clone or download zip
Setup an account in Kaggle (www.kaggle.com)
We will be using the data from 2 Kaggle competitions
①  Titanic: Machine Learning from Disaster
Download data from http://www.kaggle.com/c/titanic-gettingStarted
Directory ~/hairy-octo-hipster/titanic-r
②  Predicting Bike Sharing @ Washington DC
Download data from http://www.kaggle.com/c/bike-sharing-demand/data
Directory ~/hairy-octo-hipster/bike
③  2014 NFL Boxscore
http://www.pro-football-reference.com/years/2014/games.htm
Directory ~/hairy-octo-hipster/nfl
Data

Agenda
o  Jan 18 : 9:00-12:30 3 hrs
o  Intro, Goals, Logistics, Setup [10] [9:00-9:10)
o  Introduction to R & dplyR [30] [9:10-9:40)
o  Who will win Superbowl XLIX ?
The Art of ELO Ranking [30] [9:40-10:10)
•  The Algorithm
•  The Data
•  The Results (Compare with FiveThirtyEight
o  Anatomy of a Kaggle Competition [40] [10:10-10:50)
•  Competition Mechanics
•  Register, download data, create sub
directories
•  Trial Run : Submit Titanic
o  Break [20] [10:50-11:10)
o  Algorithms for the Amateur Data Scientist [20] [11:10-11:30)
•  Algorithms, Tools & frameworks in perspective
•  “Folk Wisdom”
o  Model Evaluation & Interpretation [30] [11:30 - 12:00)
•  Confusion Matrix, ROC Graph
o  Homework : The Art of a Competition – Bike Sharing
o  Homework : The Art of a Competition – Walmart

Overload Warning … There is enough material for a week’s training … which is good & bad !
Read thru at your pace, refer, ponder & internalize

Close Encounters
—  1st

◦  This Tutorial
—  2nd

◦  Do More Hands-on Walkthrough
—  3nd

◦  Listen To Lectures
◦  More competitions …

R Syntax – A quick overview
o aString <- "A String"
o aNumber <- 12
o class(aString)
o class(aNumber)
o aVector <- c(1,2,3,4)
o class(aVector)
o aVector * 2
o sqrt(aVector)
o Packages : dplyR & tidyR

Data wrangling with dplyR
o  dplyR – versatile package for various data operations
o  We will see dplyR is use
o  Resources:
•  “Data Manipulation with dplyR” - Hadley Wickham’s UseR! 2014
Tutorial Slides
•  http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/
•  Slides https://www.dropbox.com/sh/i8qnluwmuieicxc/
AAAgt9tIKoIm7WZKIyK25lh6a
•  Slides of Tutorial by Rstudio’s Garrett Grolemund
•  https://github.com/rstudio/webinars
•  And the cheatsheet is available at http://www.rstudio.com/resources/
cheatsheets/

dplyR verbs
o Select
o Filter
o Summarise
o Group_by
o Mutate
o Arrange

dplyR joins
Hiroaki Yutani ‫‏‬@yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

Who will win Super Bowl
XLIX
9:40

The Art of ELO Ranking
& Super Bowl XLIX
o Let us look at this from 3 angles:
•  The Algorithm
•  The R program
•  The Data
•  The Results
•  Comparing with the
FiveThirtyEight Results
http://www.imdb.com/title/tt1285016/trivia?item=qt1318850
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S

The ELO Algorithm (1 of 3)
1.  Basic Chess Algorithm proposed by Elo
•  Arpad Emrick Elo proposed the system for Chess ranking
•  Rnew = Rold + K(S-μ); μij = 1 / 1 + 10(Riold-Rjold)/400
•  K – varies depending on the match
•  Sij = 1, ½ or 0
2.  Soccer Ranking
•  http://www.eloratings.net/system.html
3.  NFL Ranking with adjusted factor for scores, 538
Ranking
Ref : Who is #1, Princeton University Press

NFL Ranking
http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/

NFL Ranking

The Data
http://www.pro-football-reference.com/years/2014/games.htm

The R Code
https://github.com/xsankar/hairy-octo-hipster

The Analysis – Week 1, Week 18

Wisdom from Nate Silver & the 538 Gang …
o  [Homework #1] Improve our core algorithm
to add the Margin of victory from the 538
gang !
•  Remember, kFactor = 20
o  [Homework #2] Weigh recent games more
heavily w/ Exponential Decay

The Art of ELO Ranking
& Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S
Ref : Who is #1, Princeton University Press

References:
o  ELO ranking – NFL,Soccer
•  http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
•  http://fivethirtyeight.com/datalab/nfl-week-20-elo-ratings-and-playoff-
odds-conference-championships/
•  http://www.eloratings.net/system.html
o  dplyR
•  http://www.rstudio.com/resources/webinars/ <- github for the slides
•  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part1/
•  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part2/
•  http://www.rstudio.com/resources/cheatsheets/
•  http://www.r-bloggers.com/data-analysis-example-with-ggplot-and-dplyr-
analyzing-supercar-data-part-2/

Anatomy Of a Kaggle
Competition 10:10

Kaggle Data Science Competitions
o  Hosts Data Science Competitions
o  Competition Attributes:
•  Dataset
•  Train
•  Test (Submission)
•  Final Evaluation Data Set (We don’t
see)
•  Rules
•  Time boxed
•  Leaderboard
•  Evaluation function
•  Discussion Forum
•  Private or Public

Titanic
Passenger
Metadata

•  Small

•  3
Predictors

•  Class

•  Sex

•  Age

•  Survived?

http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic
http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
City Bike Sharing Prediction (Washington DC)
Walmart Store Forecasting

Train.csv

Taken
from
Titanic
Passenger

Manifest

Variable
Descrip-on

Survived
0-‐No,
1=yes

Pclass
Passenger
Class
(
1st,2nd,3rd
)

Sibsp
Number
of
Siblings/Spouses
Aboard

Parch
Number
of
Parents/Children
Aboard

Embarked
Port
of
EmbarkaMon

o  C
=
Cherbourg

o  Q
=
Queenstown

o  S
=
Southampton

Titanic
Passenger
Metadata

•  Small

•  3
Predictors

•  Class

•  Sex

•  Age

•  Survived?

Test.csv

Submission
o 418 lines; 1st column should have 0 or 1 in each line
o Evaluation:
•  % correctly predicted

Approach
o  This is a classification problem - 0 or 1
o  Comb the forums !
o  Opportunity for us to try different algorithms & compare them
•  Simple Model
•  CART[Classification & Regression Tree]
•  Greedy, top-down binary, recursive partitioning that divides feature space into sets
of disjoint rectangular regions
•  RandomForest
•  Different parameters
•  SVM
•  Multiple kernels
•  Table the results
o  Use cross validation to predict our model performance & correlate with what Kaggle
says
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf

Simple Model – Our First Submission
o #1 : Simple Model (M=survived)
o #2 : Simple Model (F=survived)
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
Refer
to
1-‐Intro_to_Kaggle.R

at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/

#3 : Simple CART Model
o CART (Classification & Regression Tree)
hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassiﬁcaMon/Decision_Trees

May be better, because we have improved on the survival of
men !
Refer
to

at

#4 : Random Forest Model
o  https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
•  Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/
o  https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
o  https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
Refer
to

at

#5 : SVM
o Multiple Kernels
o kernel = ‘radial’ #Radial Basis Function
o Kernel = ‘sigmoid’
o  agconti's blog - Ultimate Titanic !
o  http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
Refer
to

at

Feature Engineering - Homework
o  Add attribute : Age
•  In train 714/891 have age; in test 332/418 have age
•  Missing values can be just Mean Age of all passengers
•  We could be more precise and calculate Mean Age based on Title (Ms,
Mrs, Master et al)
•  Box plot age
o  Add attribute : Mother, Family size et al
o  Feature engineering ideas
•  http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/
sharing-experiences-about-data-munging-and-classification-steps-
with-python
o  More ideas at
http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/
o  And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md

What does it mean ? Let us ponder ….
o  We have a training data set representing a domain
•  We reason over the dataset & develop a model to predict outcomes
o  How good is our prediction when it comes to real life scenarios ?
o  The assumption is that the dataset is taken at random
•  Or Is it ? Is there a Sampling Bias ?
•  i.i.d ? Independent ? Identically Distributed ?
•  What about homoscedasticity ? Do they have the same finite variance ?
o  Can we assure that another dataset (from the same domain) will give us the same
result ?
o  Will our model & it’s parameters remain the same if we get another data set ?
o  How can we evaluate our model ?
o  How can we select the right parameters for a selected model ?

Algorithms for the
Amateur Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still
know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …
11:10

Ref: Anthony’s Kaggle Presentation
Data Scientists apply different techniques
•  Support Vector Machine
•  adaBoost
•  Bayesian Networks
•  Decision Trees
•  Ensemble Methods
•  Random Forest
•  Logistic Regression
•  Genetic Algorithms
•  Monte Carlo Methods
•  Principal Component Analysis
•  Kalman Filter
•  Evolutionary Fuzzy Modelling
•  Neural Networks
Quora
•  http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms

Algorithm spectrum
o  Regression
o  Logit
o  CART
o  Ensemble :
Random
Forest
o  Clustering
o  KNN
o  Genetic Alg
o  Simulated
Annealing

o  Collab
Filtering
o  SVM
o  Kernels
o  SVD
o  NNet
o  Boltzman
Machine
o  Feature
Learning

Machine
Learning
Cute
Math

Ar0ﬁcial

Intelligence

Classifying Classifiers
Statistical
Structural

Regression
Naïve

Bayes

Bayesian

Networks

Rule-‐based
Distance-‐based

Neural

Networks

Production
Rules
Decision
Trees

Multi-‐layer

Perception

Functional
Nearest
Neighbor

Linear
Spectral

Wavelet

kNN
Learning
vector

Quantization

Ensemble

Random
Forests

Logistic

Regression1

SVM
Boosting

1Max
Entropy
Classiﬁer

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Classiﬁers

Regression

Continuous
Variables
Categorical
Variables
Decision

Trees

k-‐NN(Nearest

Neighbors)

Bias

Variance

Model Complexity

Over-ﬁtting

BoosMng
Bagging

CART

Data Science
“folk knowledge”

Data Science “folk knowledge” (1 of A)
o  "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer
Mediated Transactions
o  Learning = Representation + Evaluation + Optimization
o  It’s Generalization that counts
•  The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o  Data alone is not enough
•  Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond
it
o  Machine Learning is not magic – one cannot get something from nothing
•  In order to infer, one needs the knobs & the dials
•  One also needs a rich expressive datasetA few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755

o  Over fitting has many faces
•  Bias – Model not strong enough. So the learner has the tendency to learn the
same wrong things
•  Variance – Learning too much from one dataset; model will fall apart (ie much
less accurate) on a different dataset
•  Sampling Bias
o  Intuition Fails in high Dimensions –Bellman
•  Blessing of non-conformity & lower effective dimension; many applications
have examples not uniformly spread but concentrated near a lower dimensional
manifold eg. Space of digits is much smaller then the space of images
o  Theoretical Guarantees are not What they seem
•  One of the major developments o f recent decades has been the realization that
we can have guarantees on the results of induction, particularly if we are
willing to settle for probabilistic guarantees.
o  Feature engineering is the Key
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755

o  More Data Beats a Cleverer Algorithm
•  Or conversely select algorithms that improve with data
•  Don’t optimize prematurely without getting more data
o  Learn many models, not Just One
•  Ensembles ! – Change the hypothesis space
•  Netflix prize
•  E.g. Bagging, Boosting, Stacking
o  Simplicity Does not necessarily imply Accuracy
o  Representable Does not imply Learnable
•  Just because a function can be represented does not mean
it can be learned
o  Correlation Does not imply Causation
o  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o  A few useful things to know about machine learning - by Pedro Domingos
§  http://dl.acm.org/citation.cfm?id=2347755

o  The simplest hypothesis that fits the data is also the most
plausible
•  Occam’s Razor
•  Don’t go for a 4 layer Neural Network unless
you have that complex data
•  But that doesn’t also mean that one should
choose the simplest hypothesis
•  Match the impedance of the domain, data & the
algorithms
o  Think of over fitting as memorizing as opposed to learning.
o  Data leakage has many forms
o  Sometimes the Absence of Something is Everything
o  [Corollary] Absence of Evidence is not the Evidence of
Absence
New to Machine Learning? Avoid these three mistakes, James Faghmous
https://medium.com/about-data/73258b3848a4
§  Simple
Model

§  High
Error
line
that
cannot
be

compensated
with
more
data

§  Gets
to
a
lower
error
rate
with
less
data

points

§  Complex
Model

§  Lower
Error
Line

§  But
needs
more
data
points
to
reach

decent
error

Ref: Andrew Ng/Stanford, Yaser S./CalTech

Importance of feature selection & weak models
o “Good features allow a simple model to beat a complex model”-Ben Lorica1
o “… using many weak predictors will always be more accurate than using a few
strong ones …” –Vladimir Vapnik2
o “A good decision rule is not a simple one, it cannot be described by a very few
parameters” 2
o “Machine learning science is not only about computers, but about humans, and
the unity of logic, emotion, and culture.” 2
o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well,
but it can’t surprise you” – Hadley Wickham3
hTp://radar.oreilly.com/2014/06/streamlining-‐feature-‐engineering.html

hTp://nauMl.us/issue/6/secret-‐codes/teaching-‐me-‐so^ly

hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-‐modeling-‐and-‐surprises/

Updated
Slide

Check your assumptions
o  The decisions a model makes, is directly related to the it’s assumptions about the
statistical distribution of the underlying data
o  For example, for regression one should check that:
① Variables are normally distributed
•  Test for normality via visual inspection, skew & kurtosis, outlier inspections via
plots, z-scores et al
② There is a linear relationship between the dependent & independent
variables
•  Inspect residual plots, try quadratic relationships, try log plots et al
③ Variables are measured without error
④ Assumption of Homoscedasticity
§  Homoscedasticity assumes constant or near constant error variance
§  Check the standard residual plots and look for heteroscedasticity
§  For example in the figure, left box has the errors scattered randomly around zero; while the
right two diagrams have the errors unevenly distributed
Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test,
http://pareonline.net/getvn.asp?v=8&n=2

Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World

Knowns

Unknowns

You
UnKnown
Known

o  Others
know,
you
don’t
o  What
we
do

o  Facts,
outcomes
or

scenarios
we
have
not

encountered,
nor

considered

o  “Black
swans”,
outliers,

long
tails
of
probability

distribuMons

o  Lack
of
experience,

imaginaMon

o  PotenMal
facts,

outcomes
we

are
aware,
but

not

with

certainty

o  StochasMc

processes,

ProbabiliMes

o  Known Knowns
o  There are things we know that we know
o  Known Unknowns
o  That is to say, there are things that we
now know we don't know
o  But there are also Unknown Unknowns
o  There are things we do not know we
don't know

Data Science “folk knowledge” (6 of A) - Pipeline
o  Scalable
Model

Deployment

o  Big
Data

automation
&

purpose
built

appliances
(soft/
hard)

o  Manage
SLAs
&

response
times

o  Volume

o  Velocity

o  Streaming
Data

o  Canonical
form

o  Data
catalog

o  Data
Fabric
across
the

organization

o  Access
to
multiple

sources
of
data

o  Think
Hybrid
–
Big
Data

Apps,
Appliances
&

Infrastructure

Collect Store Transform
o  Metadata

o  Monitor
counters
&

Metrics

o  Structured
vs.
Multi-‐
structured

o  Flexible
&
Selectable

§  Data
Subsets

§  Attribute
sets

o  Reﬁne
model
with

§  Extended
Data

subsets

§  Engineered

Attribute
sets

o  Validation
run
across
a

larger
data
set

Reason Model Deploy
Data Management
Data Science
o  Dynamic
Data
Sets

o  2
way
key-‐value
tagging
of

datasets

o  Extended
attribute
sets

o  Advanced
Analytics

ExploreVisualize Recommend Predict
o  Performance

o  Scalability

o  Refresh
Latency

o  In-‐memory
Analytics

o  Advanced
Visualization

o  Interactive
Dashboards

o  Map
Overlay

o  Infographics

¤  Bytes to Business
a.k.a. Build the full
stack
¤  Find Relevant Data
For Business
¤  Connect the Dots

Volume
Velocity
Variety
Context
Connect
edness
Intelligence
Interface
Inference
“Data of unusual size”
that can't be brute forced
o  Three Amigos
o  Interface = Cognition
o  Intelligence = Compute(CPU) & Computational(GPU)
o  Infer Significance & Causality

Jeremy’s Axioms
o  Iteratively explore data
o  Tools
•  Excel Format, Perl, Perl Book
o  Get your head around data
•  Pivot Table
o  Don’t over-complicate
o  If people give you data, don’t assume that you
need to use all of it
o  Look at pictures !
o  History of your submissions – keep a tab
o  Don’t be afraid to submit simple solutions
•  We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-
howard/

①  Common Sense (some features make more sense then others)
②  Carefully read these forums to get a peak at other peoples’ mindset
③  Visualizations
④  Train a classifier (e.g. logistic regression) and look at the feature weights
⑤  Train a decision tree and visualize it
⑥  Cluster the data and look at what clusters you get out
⑦  Just look at the raw data
⑧  Train a simple classifier, see what mistakes it makes
⑨  Write a classifier using handwritten rules
⑩  Pick a fancy method that you want to apply (Deep Learning/Nnet)
-- Maarten Bosma
-- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data

Data Science “folk knowledge” (A of A)
Lessons from Kaggle Winners
①  Don’t over-fit
②  All predictors are not needed
•  All data rows are not needed, either
③  Tuning the algorithms will give different results
④  Reduce the dataset (Average, select transition data,…)
⑤  Test set & training set can differ
⑥  Iteratively explore & get your head around data
⑦  Don’t be afraid to submit simple solutions
⑧  Keep a tab & history your submissions

The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
Data Scientist (noun): Person who is better at
statistics than any software engineer & better
at software engineering than any statistician
– Josh Wills (Cloudera)
Data Scientist (noun): Person who is worse at
statistics than any statistician & worse at
software engineering than any software
engineer – Will Cukierski (Kaggle)
http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/
Large is hard; Infinite is much easier !
– Titus Brown

Essential Reading List
o  A few useful things to know about machine learning - by Pedro Domingos
•  http://dl.acm.org/citation.cfm?id=2347755
o  The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
•  http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf
o  http://www.no-free-lunch.org/
o  Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,
Y. C
•  http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf
o  A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
•  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o  Avoid these three mistakes, James Faghmo
•  https://medium.com/about-data/73258b3848a4
o  Leakage in Data Mining: Formulation, Detection, and Avoidance
•  http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf

For your reading & viewing pleasure … An ordered List
①  An Introduction to Statistical Learning
•  http://www-bcf.usc.edu/~‾gareth/ISL/
②  ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
•  http://online.stanford.edu/course/statistical-learning-winter-2014
③  Prof. Pedro Domingo
•  https://class.coursera.org/machlearning-001/lecture/preview
④  Prof. Andrew Ng
•  https://class.coursera.org/ml-003/lecture/preview
⑤  Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
•  https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥  Mathematicalmonk @ YouTube
•  https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦  The Elements Of Statistical Learning
•  http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/

Of Models,
Performance, Evaluation
& Interpretation
11:30

Bias/Variance (1 of 2)
o Model Complexity
•  Complex Model increases the
training data fit
•  But then it overfits & doesn't
perform as well with real data
o  Bias vs. Variance
o  Classical diagram
o  From ELSII, By Hastie, Tibshirani & Friedman
o  Bias – Model learns wrong things; not
complex enough; error gap small; more
data by itself won’t help
o  Variance – Different dataset will give
different error rate; over fitted model;
larger error gap; more data could help
Prediction Error

Training

Error

Ref: Andrew Ng/Stanford, Yaser S./CalTech
Learning Curve

Bias/Variance (2 of 2)
o High Bias
•  Due to Underfitting
•  Add more features
•  More sophisticated model
•  Quadratic Terms, complex equations,…
•  Decrease regularization
o High Variance
•  Due to Overfitting
•  Use fewer features
•  Use more training sample
•  Increase Regularization
Prediction Error

Training

Error

Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need
more
features
or
more

complex
model
to
improve

Need
more
data
to
improve

'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos

Partition Data !
•  Training (60%)
•  Validation(20%) &
•  “Vault” Test (20%) Data sets
k-fold Cross-Validation
•  Split data into k equal parts
•  Fit model to k-1 parts &
calculate prediction error on kth
part
•  Non-overlapping dataset
Data Partition &
Cross-Validation
—  Goal

◦  Model Complexity (-)

◦  Variance (-)

◦  Prediction Accuracy (+)

Train
Validate
Test

#2
#3
#4

#5

#1

#2
#3
#5

#4

#1

#2
#4
#5

#3

#1

#3
#4
#5

#2

#1

#3
#4
#5

#1

#2

K-‐fold
CV
(k=5)

Train
Validate

Bootstrap
•  Draw datasets (with replacement) and fit model for each dataset
•  Remember : Data Partitioning (#1) & Cross Validation (#2) are without
replacement
Bootstrap & Bagging
—  Goal


◦  Variance (-)


Bagging (Bootstrap aggregation)
◦  Average prediction over a collection of
bootstrap-ed samples, thus reducing
variance

◦  “Output
of
weak
classifiers
into
a
powerful
commiTee”

◦  Final
PredicMon
=
weighted
majority
vote

◦  Later
classifiers
get
misclassified
points

–  With
higher
weight,

–  So
they
are
forced

–  To
concentrate
on
them

◦  AdaBoost
(AdapMveBoosting)

◦  BoosMng
vs
Bagging

–  Bagging
–
independent
trees

–  BoosMng
–
successively
weighted

Boosting
—  Goal


◦  Variance (-)


◦  Builds
large
collecMon
of
de-‐correlated
trees
&
averages

them

◦  Improves
Bagging
by
selecMng
i.i.d*
random
variables
for

spliong

◦  Simpler
to
train
&
tune

◦  “Do
remarkably
well,
with
very
li@le
tuning
required”
–
ESLII

◦  Less
suscepMble
to
over
ﬁong
(than
boosMng)

◦  Many
RF
implementaMons

–  Original
version
-‐
Fortran-‐77
!
By
Breiman/Cutler

–  Python,
R,
Mahout,
Weka,
Milk
(ML
toolkit
for
py),
matlab

* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
—  Goal


◦  Variance (-)


◦  Two
Step

–  Develop
a
set
of
learners

–  Combine
the
results
to
develop
a
composite
predictor

◦  Ensemble
methods
can
take
the
form
of:

–  Using
different
algorithms,

–  Using
the
same
algorithm
with
different
seongs

–  Assigning
different
parts
of
the
dataset
to
different
classifiers

◦  Bagging
&
Random
Forests
are
examples
of
ensemble

method

Ref: Machine Learning In Action
Ensemble Methods
—  Goal


◦  Variance (-)


Random Forests
o  While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)
o  Error prediction
•  For each iteration, predict for dataset that is not in the sample (OOB data)
•  Aggregate OOB predictions
•  Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
•  Can use this to search for optimal # of predictors
•  We will see how close this is to the actual error in the Heritage Health Prize
o  Assumes equal cost for mis-prediction. Can add a cost function
o  Proximity matrix & applications like adding missing data, dropping outliers
Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg

Model Evaluation &
Interpretation
Relevant Digression

Cross Validation
o Reference:
•  https://www.kaggle.com/wiki/
GettingStartedWithPythonForDataScience
•  Chris Clark ‘s blog :
http://blog.kaggle.com/2012/07/02/up-and-running-with-python-
my-first-kaggle-entry/
•  Predicive Modelling in py with scikit-learning, Olivier Grisel Strata
2013
•  titanic from pycon2014/parallelmaster/An introduction to Predictive
Modeling in Python

Model Evaluation - Accuracy
o Accuracy =
o For cases where tn is large compared tp, a degenerate return(false) will be
very accurate !
o Hence the F-measure is a better reflection of the model strength
Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)
–
Type
II

Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)

tp
+
tn

tp+fp+fn+tn

Model Evaluation – Precision & Recall
o  Precision = How many items we identified are relevant
o  Recall = How many relevant items did we identify
o  Inverse relationship – Tradeoff depends on situations
•  Legal – Coverage is important than correctness
•  Search – Accuracy is more important
•  Fraud
•  Support cost (high fp) vs. wrath of credit card co.(high fn)

tp

tp+fp

•  Precision

•  Accuracy

•  Relevancy

tp

tp+fn

•  Recall

•  True
+ve
Rate

•  Coverage

•  Sensitivity

•  Hit
Rate

http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

fp

fp+tn

•  Type
1
Error

Rate

•  False
+ve
Rate

•  False
Alarm
Rate

•  Speciﬁcity
=
1
–
fp
rate

•  Type
1
Error
=
fp

•  Type
2
Error
=
fn

Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)
-‐
Type
II

Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)

Confusion Matrix

Actual

Predicted

C1
C2
C3
C4

C1
10
5
9
3

C2
4
20
3
7

C3
6
4
13
3

C4
2
1
4
15

Correct
Ones
(cii)

Precision
=

Columns

i

cii

cij

Recall
=

Rows

j

cii

cij

Σ
Σ

Model Evaluation : F-Measure
Precision = tp / (tp+fp) : Recall = tp / (tp+fn)
F-Measure
Balanced, Combined, Weighted Harmonic Mean, measures effectiveness
=

β2
P
+
R

Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
+
(1
–
α)
α
1

P

1

R

1
(β2
+
1)PR

Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)
-‐
Type
II

Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)

Hands-on Walkthru - Model Evaluation
Train
Test

712 (80%) 179
891
hTp://cran.r-‐project.org/web/packages/e1071/vigneTes/
svmdoc.pdf
-‐
model
eval

Kappa
measure
is
interesMng

Refer
to
2-‐Model_EvaluaMon.R

at

ROC Analysis
o “How good is my model?”
o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
o “A receiver operating characteristics (ROC) graph is a technique for visualizing,
organizing and selecting classiﬁers based on their performance”
o Much better than evaluating a model based on simple classification accuracy
o Plots tp rate vs. fp rate

ROC Graph - Discussion
o  E = Conservative, Everything
NO
o  H = Liberal, Everything YES
o Am not making any
political statement !
o  F = Ideal
o  G = Worst
o  The diagonal is the chance
o  North West Corner is good
o  South-East is bad
•  For example E
•  Believe it or Not - I have
actually seen a graph
with the curve in this
region !
E
F
G
H
Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)

Actual=0
False+
(fp)
True-‐
(tn)

ROC Graph – Clinical Example
Ifcc
:
Measures
of
diagnostic
accuracy:
basic
deﬁnitions

ROC Graph Walk thru
Refer
to
2-‐Model_EvaluaMon.R
at

The Beginning As The End
Who will win Super BOWL
XLIX ?
12:15

References:
o  An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
•  http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-
learning
o  Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
•  http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o  Just The Basics, Strata 2013, William Cukierski & Ben Hamner
•  http://strataconf.com/strata2013/public/schedule/detail/27291
o  The Problem of Multiple Testing
•  http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf

Homework:
Bike Sharing at Washington DC
12:30

Few interesting Links - Comb the forums
o  Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing
•  Solution by Brandon Harris
o  Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-
prediction
o  GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm
o  Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare
o  Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances
o  Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour
o  Casual & Registered Users :
http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count
o  RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r
o  Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data
o  Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402
o  Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html

Data Organization – train, test & submission
•  datetime - hourly date + timestamp
•  Season
•  1 = spring, 2 = summer, 3 = fall, 4 = winter
•  holiday - whether the day is considered a holiday
•  workingday - whether the day is neither a weekend nor holiday
•  Weather
•  1: Clear, Few clouds, Partly cloudy, Partly cloudy
•  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
•  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds
•  4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
•  temp - temperature in Celsius
•  atemp - "feels like" temperature in Celsius
•  humidity - relative humidity
•  windspeed - wind speed
•  casual - number of non-registered user rentals initiated
•  registered - number of registered user rentals initiated
•  count - number of total rentals

Approach
o Convert to factors
o Engineer new features from date
o Explore other synthetic features

#1 : ctree
Refer
to
3-‐Session-‐I-‐Bikes.R

at

Who will win Super Bowl XLIX

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Who will win Super Bowl XLIX

Ähnlich wie Who will win Super Bowl XLIX (20)

Mehr von Krishna Sankar

Mehr von Krishna Sankar (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Who will win Super Bowl XLIX