Statistical learning intro

Introduction to
Machine/Statistical Learning
peishen.wu@gmail.com
Taipei Hackerspace, 2014.9.20

The purpose of this talk
• Not to develop robust understanding of ML
algorithms nor to derive them
• But to provide sufficient basis to do applied
predictive modeling
• Our goal is to do prediction modeling, building
accurate models by utilizing statistical
principles, feature engineering, model tuning,
applying appropriate ML and do error analysis

Preliminary outline
• Model purpose – for prediction, for explanation
• The basic study design of Machine learning
– Model Representation
– Classification vs. Regression Problems
– Supervised vs. Unsupervised Learning
• Model Assessment & Selection
– Interplay between Bias, Variance & Complexity
– Cross Validation: The wrong/correct way of doing it
• The Single Algorithm Hypothesis & Deep Learning

Ex. Models for Explanation
Wong, P. T. P. (2014). Viktor Frankl’s meaning seeking model and positive psychology.

Coursera Course, Machine learning by Andrew Ng

Andrew Ng: Deep Learning, Self-Taught Learning and Unsupervised Feature Learning

Andrew'Ng'
Training'Set'
Learning'Algorithm'
h"Size'of'
house'
Es6mated'
price'
How(do(we(represent(h(?(
Linear'regression'with'one'variable.'
Univariate'linear'regression.'

Andrew'Ng'
How'to'choose'''''‘s'?'
Training'Set'
Hypothesis:'
‘s:''''''Parameters'
x y
2104' 460'
1416' 232'
1534' 315'
852' 178'
…' …'

Andrew'Ng'
Hypothesis:'
Parameters:'
Cost'Func6on:'
Goal:'

Andrew'Ng'
(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'
0'
100'
200'
300'
400'
500'
0' 1000' 2000' 3000'
Price'($)''
in'1000’s'
Size'in'feet2'(x)'

Andrew'Ng'

Abstract supervised setup
• Training :
• : input vector
• y : response variable
– : binary classification
– : regression
– what we want to be able to predict, having
observed some new .
x i =
⇧
⇧
⇧
⇤
xi ,1
xi ,2
...
xi ,n
⇥
⌃
⌃
⌃
⌅
, xi ,j ∈ R
Independent Variables
Predictors
Features
Dependent Variables
Responses

Andrew'Ng'
Bias/ variance(
High'bias'
(underfit)'
“Just'right”' High'variance'
(overfit)'
Price'
Size'
Price'
Size'
Price'
Size'

To recap: some definitions
• Variance
– the amount which the prediction would change if we
estimated it using a different training data set
• Bias
– the error that is introduced by approximating a real-
life problem
– more flexible methods result in less bias, but more
variance
• Flexibility = degrees of freedom ~ Complexity
– Can be modified by regularization parameter
– or increase/reduce number of features

Study design – training/test sets
raining and Quiz/Test sets come from di&erent dis$
ributions. Since submissions to the competition can
nly be done once per day, this Probe set allows for a
ghter feedback loop for evaluation of promising
models.
An Introduction to Statistical Learning, Ch 5 Resampling Methods

In practice – training/CV/test set
• Training set
– used to fit the models
• Validation set
– used to estimate prediction error for model selection
• Test set
– used for assessment of the generalization error of the final chosen
model.
The Elements of Statistical Learning ch7. Model Assessment and Selection

Andrew'Ng'
Bias/ variance(
degree'of'polynomial'd'
error'
Training'error:'
Cross'validaDon'error:'
參數來源 θ (x(i)
, y(i)
)
Training error Training set Training set
CV error Training set CV set

Andrew'Ng'
Bias/ variance(as(a(func5on(of(the(regulariza5on(parameter(

Andrew'Ng'
High(bias(
(training'set'size)'
error'
size'
price'
size'
price'
If'a'learning'algorithm'is'suffering'
from' high' bias,' geJ ng' more'
training' data' will' not' (by' itself)'
help'much.'
Andrew'Ng'
High(variance(
(training'set'size)'
error'
size'
price'
size'
price'
If'a'learning'algorithm'is'suffering'
from' high' variance,' geJ ng' more'
training'data'is'likely'to'help.'
(and'small''''')'
Coursera Course, Machine
learning by Andrew Ng

The Bias-Variance Trade-Off

Cross validation – single split

Cross validation – n = 10 folds

K-fold Cross validation ensures
better estimation of test error

Compare these two CV methods,
what’s different and what’s wrong ?
1. Screen the predictors
– find a subset of “good”
predictors that show fairly
strong (univariate)
correlation with the class
labels
1. Build a multivariate
classifier
– Using just this subset of
predictors
1. Apply cross-validation
– to estimate the unknown
tuning parameters and to
estimate the prediction
error of the final model.
1. Divide the samples into K
cross-validation folds (groups)
at random
2. For each fold k = 1,2,...,K
a. Find a subset of “good”
predictors that show fairly strong
(univariate) correlation with the
class labels, using all of the
samples except those in fold k.
b. Using just this subset of
predictors, build a multivariate
classifier, using all of the samples
except those in fold k.
c. Use the classifier to predict the
class labels for the samples in
fold k.

The predictors chosen by the left
method have an unfair advantage
• they were chosen in step
(1) on the basis of all of
the samples.
• Leaving samples out after
the variables have been
selected does not
correctly mimic the
application of the
classifier to a completely
independent test set
• these predictors “have
already seen” the left out
samples.
The Elements of Statistical Learning ch7. Model Assessment and Selection

Recap principles from Statistics
– K-fold CV is a form of random sampling
Coursera Course, Data Analysis and Statistical Inference by Dr. Mine Çetinkaya-Rundel

ML algorithm performance is
dependent on the underlying data
An Introduction to Statistical
Learning, Ch 8 Tree methods

More issues to be covered in next talk
• Remedies for Severe Class Imbalance
• Measuring Predictor Importance
• Factors That Can Affect Model Performance

Back then, the prevailing wisdom
• MIT's Marvin Minsky - a "Society of Mind”
– To achieve AI, it was believed, engineers would
have to build and combine thousands of individual
computing units or agents.
– One group of agents, or module, would handle
vision, another language, and so on…

The Single Algorithm Hypothesis
• Human intelligence stems from a single learning
algorithm
– In 1978 paper by Vernon Mountcastle: An Organizing
Principle for Cerebral Function
– Jeff Hawkins “Memory-prediction framework”
• Origin
– Neuroplasticity during brain development
– Potential of other cortical areas to cover previous lost
function after brain injury (eg. stroke)

Deep Learning - 1
• Single Algorithm
– neural networks to mimic human brain behavior
• A basic layer of artificial neurons that can detect simple
things like the edges of a particular shape
• The next layer could then piece together these edges
to identify the larger shape
• Then the shapes could be strung together to
understand an object
• Key: the software does all this on its own
– give the system a lot of data, so it can discover by
itself what some of the concepts in the world are
The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI, Wired

Deep Learning - 2
• This approach is inspired by how scientists believe that
humans learn.
– The algorithm didn’t know the word “cat” — Ng had to
supply that — but over time, it learned to identify the
furry creatures we know as cats, all on its own.
– As babies, we watch our environments and start to
understand the structure of objects we encounter, but
until a parent tells us what it is, we can’t put a name to it.
• Building High-level Features Using Large Scale
Unsupervised Learning
The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI, Wired
Building High-level Features Using Large Scale Unsupervised Learning, QV Le, et al

References
Stanford Andrew Ng course

Statistical learning intro

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Statistical learning intro

Ähnlich wie Statistical learning intro (20)

Mehr von Pei-shen (James) Wu

Mehr von Pei-shen (James) Wu (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Statistical learning intro