The document provides an introduction to machine/statistical learning. It outlines the talk which aims to provide a sufficient basis for applied predictive modeling rather than developing a robust understanding of ML algorithms. The preliminary outline covers model purpose, the basic study design of ML including model representation, classification vs regression problems, and supervised vs unsupervised learning. It also discusses model assessment and selection including the interplay between bias, variance and complexity, and cross-validation. The last point is on the single algorithm hypothesis and deep learning.
2. The purpose of this talk
• Not to develop robust understanding of ML
algorithms nor to derive them
• But to provide sufficient basis to do applied
predictive modeling
• Our goal is to do prediction modeling, building
accurate models by utilizing statistical
principles, feature engineering, model tuning,
applying appropriate ML and do error analysis
3. Preliminary outline
• Model purpose – for prediction, for explanation
• The basic study design of Machine learning
– Model Representation
– Classification vs. Regression Problems
– Supervised vs. Unsupervised Learning
• Model Assessment & Selection
– Interplay between Bias, Variance & Complexity
– Cross Validation: The wrong/correct way of doing it
• The Single Algorithm Hypothesis & Deep Learning
4.
5. Ex. Models for Explanation
Wong, P. T. P. (2014). Viktor Frankl’s meaning seeking model and positive psychology.
9. Preliminary outline
• Model purpose – for prediction, for explanation
• The basic study design of Machine learning
– Model Representation
– Classification vs. Regression Problems
– Supervised vs. Unsupervised Learning
• Model Assessment & Selection
– Interplay between Bias, Variance & Complexity
– Cross Validation: The wrong/correct way of doing it
• The Single Algorithm Hypothesis & Deep Learning
19. Abstract supervised setup
• Training :
• : input vector
• y : response variable
– : binary classification
– : regression
– what we want to be able to predict, having
observed some new .
x i =
⇧
⇧
⇧
⇤
xi ,1
xi ,2
...
xi ,n
⇥
⌃
⌃
⌃
⌅
, xi ,j ∈ R
Independent Variables
Predictors
Features
Dependent Variables
Responses
20. Preliminary outline
• Model purpose – for prediction, for explanation
• The basic study design of Machine learning
– Model Representation
– Classification vs. Regression Problems
– Supervised vs. Unsupervised Learning
• Model Assessment & Selection
– Interplay between Bias, Variance & Complexity
– Cross Validation: The wrong/correct way of doing it
• The Single Algorithm Hypothesis & Deep Learning
27. To recap: some definitions
• Variance
– the amount which the prediction would change if we
estimated it using a different training data set
• Bias
– the error that is introduced by approximating a real-
life problem
– more flexible methods result in less bias, but more
variance
• Flexibility = degrees of freedom ~ Complexity
– Can be modified by regularization parameter
– or increase/reduce number of features
28. Study design – training/test sets
raining and Quiz/Test sets come from di&erent dis$
ributions. Since submissions to the competition can
nly be done once per day, this Probe set allows for a
ghter feedback loop for evaluation of promising
models.
An Introduction to Statistical Learning, Ch 5 Resampling Methods
29. In practice – training/CV/test set
• Training set
– used to fit the models
• Validation set
– used to estimate prediction error for model selection
• Test set
– used for assessment of the generalization error of the final chosen
model.
The Elements of Statistical Learning ch7. Model Assessment and Selection
37. Compare these two CV methods,
what’s different and what’s wrong ?
1. Screen the predictors
– find a subset of “good”
predictors that show fairly
strong (univariate)
correlation with the class
labels
1. Build a multivariate
classifier
– Using just this subset of
predictors
1. Apply cross-validation
– to estimate the unknown
tuning parameters and to
estimate the prediction
error of the final model.
1. Divide the samples into K
cross-validation folds (groups)
at random
2. For each fold k = 1,2,...,K
a. Find a subset of “good”
predictors that show fairly strong
(univariate) correlation with the
class labels, using all of the
samples except those in fold k.
b. Using just this subset of
predictors, build a multivariate
classifier, using all of the samples
except those in fold k.
c. Use the classifier to predict the
class labels for the samples in
fold k.
38. The predictors chosen by the left
method have an unfair advantage
• they were chosen in step
(1) on the basis of all of
the samples.
• Leaving samples out after
the variables have been
selected does not
correctly mimic the
application of the
classifier to a completely
independent test set
• these predictors “have
already seen” the left out
samples.
The Elements of Statistical Learning ch7. Model Assessment and Selection
39. Recap principles from Statistics
– K-fold CV is a form of random sampling
Coursera Course, Data Analysis and Statistical Inference by Dr. Mine Çetinkaya-Rundel
40. ML algorithm performance is
dependent on the underlying data
An Introduction to Statistical
Learning, Ch 8 Tree methods
41. More issues to be covered in next talk
• Remedies for Severe Class Imbalance
• Measuring Predictor Importance
• Factors That Can Affect Model Performance
42. Preliminary outline
• Model purpose – for prediction, for explanation
• The basic study design of Machine learning
– Model Representation
– Classification vs. Regression Problems
– Supervised vs. Unsupervised Learning
• Model Assessment & Selection
– Interplay between Bias, Variance & Complexity
– Cross Validation: The wrong/correct way of doing it
• The Single Algorithm Hypothesis & Deep Learning
43. Back then, the prevailing wisdom
• MIT's Marvin Minsky - a "Society of Mind”
– To achieve AI, it was believed, engineers would
have to build and combine thousands of individual
computing units or agents.
– One group of agents, or module, would handle
vision, another language, and so on…
44. The Single Algorithm Hypothesis
• Human intelligence stems from a single learning
algorithm
– In 1978 paper by Vernon Mountcastle: An Organizing
Principle for Cerebral Function
– Jeff Hawkins “Memory-prediction framework”
• Origin
– Neuroplasticity during brain development
– Potential of other cortical areas to cover previous lost
function after brain injury (eg. stroke)
45. Deep Learning - 1
• Single Algorithm
– neural networks to mimic human brain behavior
• A basic layer of artificial neurons that can detect simple
things like the edges of a particular shape
• The next layer could then piece together these edges
to identify the larger shape
• Then the shapes could be strung together to
understand an object
• Key: the software does all this on its own
– give the system a lot of data, so it can discover by
itself what some of the concepts in the world are
The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI, Wired
46. Deep Learning - 2
• This approach is inspired by how scientists believe that
humans learn.
– The algorithm didn’t know the word “cat” — Ng had to
supply that — but over time, it learned to identify the
furry creatures we know as cats, all on its own.
– As babies, we watch our environments and start to
understand the structure of objects we encounter, but
until a parent tells us what it is, we can’t put a name to it.
• Building High-level Features Using Large Scale
Unsupervised Learning
The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI, Wired
Building High-level Features Using Large Scale Unsupervised Learning, QV Le, et al