The first lecture from the Machine Learning course series of lectures. The lecture covers basic principles of machine learning, such as the difference between supervised and unsupervised learning, several classifiers: nearest neighbour (k-NN), decision trees, random forest, major obstacles in machine learning: overfitting and the curse of dimensionality, followed by cross-validation algorithm and general ML pipeline. A link to my github (https://github.com/skyfallen/MachineLearningPracticals) with practicals that I have designed for this course in both R and Python. I can share keynote files, contact me via e-mail: dmytro.fishman@ut.ee.
28. Classification task
Predicting discrete value output using previously labeled
examples
also binary classification
Every time you have to distinguish between TWO
CLASSES it is a binary classification
39. You are running a company which has two problems,
namely:
Q:
a. Both problems are examples of classification problems
b. The first one is a classification task and the second one
regression problem
c. The first one is regression problem and the second one as
classification task
d. Both problems are regression problems
1. For each user in the database predict if this user will
continue using your company’s product or will move to
competitors (churn).
2. Predict the profit of your company at the end of this year
based on previous records.
How would you approach these problems?
40. You are running a company which has two problems,
namely:
Q:
a. Both problems are examples of classification problems
b. The first one is a classification task and the second one
regression problem
c. The first one is regression problem and the second one as
classification task
d. Both problems are regression problems
1. For each user in the database predict if this user will
continue using your company’s product or will move to
competitors (churn).
2. Predict the profit of your company at the end of this year
based on previous records.
How would you approach these problems?
41. Unsupervised Learning
examples slides
Clustering with google queries
On a contrary to the first category we don’t have labels to our
classes (graphs with two features from previous examples
turns into unlabelled one)
Gene expression clustering
Quiz question: “of the following examples, which would you
address using a n unsupervised learning algorithm?”
56. Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subscription offers, this is an example of ...
Q2: You are given data about seismic activity in Japan, and
you want to predict a magnitude of the next earthquake,
this is in an example of ...
Q3: Assume you want to perform supervised learning and
to predict number of newborns according to size of storks'
population (http://www.brixtonhealth.com/storksBabies.pdf),
it is an example of ...
Q4: Discriminating between spam and ham e-mails is a
classification task, true or false?
57. Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subscription offers, this is an example of
clustering
Q2: You are given data about seismic activity in Japan, and
you want to predict a magnitude of the next earthquake,
this is in an example of ...
Quiz: Assume you want to perform supervised learning and
to predict number of newborns according to size of storks'
population (http://www.brixtonhealth.com/storksBabies.pdf),
it is an example of ...
Quiz: Discriminating between spam and ham e-mails is a
classification task, true or false?
58. Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subscription offers, this is an example of
clustering
Q2: You are given data about seismic activity in Japan, and
you want to predict a magnitude of the next earthquake,
this is in an example of regression
Q3: Assume you want to perform supervised learning and
to predict number of newborns according to size of storks'
population (http://www.brixtonhealth.com/storksBabies.pdf),
it is an example of ...
Quiz: Discriminating between spam and ham e-mails is a
classification task, true or false?
59. Q3: Assume you want to perform supervised learning and
to predict number of newborns according to size of storks'
population (http://www.brixtonhealth.com/storksBabies.pdf),
it is an example of stupidity regression
Q4: Discriminating between spam and ham e-mails is a
classification task, true or false?
Q2: You are given data about seismic activity in Japan, and
you want to predict a magnitude of the next earthquake,
this is in an example of regression
Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subscription offers, this is an example of
clustering
60. Q4: Discriminating between spam and ham e-mails is a
classification task.
Q3: Assume you want to perform supervised learning and
to predict number of newborns according to size of storks'
population (http://www.brixtonhealth.com/storksBabies.pdf),
it is an example of stupidity regression
Q2: You are given data about seismic activity in Japan, and
you want to predict a magnitude of the next earthquake,
this is in an example of regression
Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subscription offers, this is an example of
clustering
75. How to quantitatively say which of
these pairs are more similar?
& &
A B CA
What about computing their pixel-wise difference?
OR
76. How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci|Σ
784
|Ai - Bi|i i
77. How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci|Σ
784
|Ai - Bi|i i
78. How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci|Σ
784
|Ai - Bi|i i
79. How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03i i
80. How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03i i
A is more similar to C than B
81. How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03i i
A is more similar (closer) to C than B
82. Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
i i
OR
A is more similar (closer) to C than B
83. Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
i i
OR
A is more similar (closer) to C than B
84. Instance Label
?
DatasetFor each new instance
We asked our friend to
write a bunch of new
digits so that we can
have something to
recognise, here is the
first one of them
92. Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Nearest Neighbour
classifier
For each new instance Dataset
93. Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
For each new instance
94. Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Very easy to implement
For each new instance
95. Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Very easy to implement
Very slow classification time
For each new instance
96. Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Very easy to implement
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
For each new instance
Very slow classification time
102. Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Very easy to implement
Very slow classification time
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
For each new instance
103. For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Fast training time O(C)
Very easy to implement
Very slow classification time
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
NN is rarely used in
practice
104. For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Fast training time O(C)
Very easy to implement
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
Can we find a better
algorithm?
Very slow classification time
115. Decision (classification) tree algorithm
1.Construct a decision
tree based on training
examples
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
116. Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
?
2.Make corresponding
comparisons
1.Construct a decision
tree based on training
examples
#213
For each new instance
117. Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
?
2.Make corresponding
comparisons
1.Construct a decision
tree based on training
examples
#213
#216
For each new instance
118. Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6
3. Report label
1.Construct a decision
tree based on training
examples
2.Make corresponding
comparisons
#213
#216
For each new instance
119. Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6 Depth=2
Once the tree is constructed
maximum 2 comparisons
would be needed to test a new
example3. Report label
1.Construct a decision
tree based on training
examples
2.Make corresponding
comparisons
#213
#216
For each new instance
120. Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6 Depth=2
In general decision trees are
*always faster than NN
algorithm3. Report label
1.Construct a decision
tree based on training
examples
2.Make corresponding
comparisons
*remember, shit happens
#213
#216
For each new instance
121. Can we find a better
algorithm?
Disadvantages of NN
Very slow classification time
Suffers from
the curse of dimensionality
122. Can we find a better
algorithm?
Disadvantages of NNDisadvantages of DT
Very slow classification time Very slow classification time
Suffers from
the curse of dimensionality
123. Can we find a better
algorithm?
Disadvantages of NNDisadvantages of DT
Also suffers from
the curse of dimensionality
Very slow classification time Very slow classification time
Suffers from
the curse of dimensionality
126. pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
127. pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
The shape of the
tree is determined
by data not our
choice
128. pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
This means that we will always have the same
output given the same input…
The shape of the
tree is determined
by data not our
choice
129. The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
This means that we will always have the same
output given the same input…
Are all input dimensions equally important
for classification?
130. The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
This means that we will always have the same
output given the same input…
How about building a lot of trees from
random parts of the data and then merging
their predictions?
Are all input dimensions equally important
for classification?
131. The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
This means that we will always have the same
output given the same input…
How about building a lot of trees from
random parts of the data and then merging
their predictions?
Are all input dimensions equally important
for classification?
Random forest algorithm
140. Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
?
For each new instance Use all constructed trees to
generate predictions
141. Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
?
For each new instance Predictions
Tree #2
Tree #1
Tree #3
142. Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
For each new instance Predictions
Tree #2
Tree #1
Tree #3?
Average 2/3
143. Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
For each new instance Predictions
Tree #2
Tree #1
Tree #36
Average 2/3 = 66.6%
144. Random forest algorithm
Instance Label
6
For each new instance Predictions
Tree #2
Tree #1
Tree #3
Average 2/3 = 66.6%
Quiz time
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
145. Q1:
Which classification algorithm(s) has(ve) the following
weaknesses:
• It takes more time to train the classifier then to
classify a new instance
• It suffers from the curse of dimensionality
A. Nearest neighbour algorithm
B. Decision tree
C. Random forest algorithm
D. None of the above
E. All of the above
146. Q1:
A. Nearest neighbour algorithm
B. Decision tree
C. Random forest algorithm
D. None of the above
E. All of the above
• It takes more time to train the classifier then to
classify a new example
• It suffers from the curse of dimensionality
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
Which classification algorithm(s) has(ve) the following
weaknesses:
147. Q2:
A. Prohibitively slow running time at
training given a lot of data
B. Highly biased classification due
prevalence of one of the classes
C. High classification error due to
excessively complex classifier
D. Poor performance of the classifier
trained on data with large number of
features
E. None of the above
Which of the following statements best defines the curse of
dimensionality
148. Q2:
Which of the following statements best defines the curse of
dimensionality
A. Prohibitively slow running time at
training given a lot of data
B. Highly biased classification due
prevalence of one of the classes
C. High classification error due to
excessively complex classifier
D. Poor performance of the classifier
trained on data with large number
of features
E. None of the above
149. Q3:
Which of the following algorithms you would prefer If you
would have to classify instances from low-dimensional data?
A. Nearest neighbour algorithm
B. Decision tree algorithm
C. Random forest algorithm
D. All mentioned would cope
E. None of the above are suitable
150. A. Nearest neighbour algorithm
B. Decision tree algorithm
C. Random forest algorithm
D. All mentioned would cope
E. None of the above are suitable
Q3:
Which of the following algorithm(s) you would prefer If you
would have to classify instances from low-dimensional data?
160. Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
A
B C
Is it A, B or C?
Support Vector Machine (SVM)
161. Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Support Vector Machine (SVM)
162. Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Margin
Support Vector Machine (SVM)
163. Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Margin
Support Vector Machine (SVM)
164. Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Support Vector Machine (SVM)
Closest points that define hyper-
plane are called support vectors
165. Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
3. Larger the distance
from the hyper-plane to
the instance, more
confident the classifier
about its prediction
more
confidence
Support Vector Machine (SVM)
Closest points that define hyper-
plane are called support vectors
166. Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Support Vector Machine (SVM)
3. Larger the distance
from the hyper-plane to
the instance, more
confident the classifier
about its prediction
Closest points that define hyper-
plane are called support vectors
167. Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Support Vector Machine (SVM)
3. Larger the distance
from the hyper-plane to
the instance, more
confident the classifier
about its prediction
Closest points that define hyper-
plane are called support vectors
Instance Label
?
For each new instance
168. Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Support Vector Machine (SVM)
3. Larger the distance
from the hyper-plane to
the instance, more
confident the classifier
about its prediction
Closest points that define hyper-
plane are called support vectors
Instance Label
?
For each new instance
169. Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Support Vector Machine (SVM)
3. Larger the distance
from the hyper-plane to
the instance, more
confident the classifier
about its prediction
Closest points that define hyper-
plane are called support vectors
Instance Label
6
For each new instance
170. Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hyper-plane
Support Vector Machine (SVM)
3. Larger the distance
from the hyper-plane to
the instance, more
confident the classifier
about its prediction
Closest points that define hyper-
plane are called support vectors
Instance Label
6
For each new instance
176. y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
z
x
2540
0
This transformation is called a kernel trick
and function z is the kernel
177. y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
z
x
2540
0
This transformation is called a kernel trick
and function z is the kernel
Wow, wow, wow, hold on!
How does this actually work?
178. For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Fast training time O(C)
Very easy to implement
Very slow classification time
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
Comparison with SVM
Disadvantages of SVM
Very slow classification time
Suffers from
the curse of dimensionality
179. For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Fast training time O(C)
Very easy to implement
Very slow classification time
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
Comparison with SVM
Disadvantages of SVM
Very slow classification time
Suffers from
the curse of dimensionality
180. For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Fast training time O(C)
Very easy to implement
Very slow classification time
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
Comparison with SVM
Disadvantages of SVM
Very slow classification time
Suffers from
the curse of dimensionality
181. For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report it’s label
Advantages of NN
Disadvantages of NN
Fast training time O(C)
Very easy to implement
Very slow classification time
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems
Comparison with SVM
Disadvantages of SVM
Very slow classification time
Suffers from
the curse of dimensionality
It might be tricky to choose
the right kernel
192. Can we trust this model?
Consider the following example:
100% accurate!
193. Can we trust this model?
Consider the following example:
Whatever happens,
predict 0
100% accurate!
194. Can we trust this model?
Consider the following example:
Whatever happens,
predict 0
Accuracy = 49/50
100% accurate!
195. Can we trust this model?
Consider the following example:
Whatever happens,
predict 0
Accuracy = 98%
100% accurate!
196. Can we trust this model?
Consider the following example:
Count
Histogram could help you figure
out if your dataset is unbalanced
100% accurate!
197. Can we trust this model?
Consider the following example:
What if my data is unbalanced?
Count
Histogram could help you figure
out if your dataset is unbalanced
100% accurate!
198. Can we trust this model?
Consider the following example:
There are few ways, we are
going to discuss them later Count
What if my data is unbalanced?
Histogram could help you figure
out if your dataset is unbalanced
100% accurate!
199. Can we trust this model?
In our case data is balanced:
100% accurate!
220. Split into train and test
Normally we would
split data into 80%
train and 20% test
sets
221. Split into train and test
Normally we would
split data into 80%
train and 20% test
sets
As we have a lot of
data we can afford
50/50 ratio
222. Split into train and test
Can we do better than 90%?
Normally we would
split data into 80%
train and 20% test
sets
As we have a lot of
data we can afford
50/50 ratio
228. Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
In green are areas where
no penalty is applied
C = 1
229. Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
In green are areas where
no penalty is applied
Total amount of penalty applied to the classifier is called loss
Classifiers try to minimise loss by adjusting their parameters
C = 1
230. Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
In green are areas where
no penalty is applied
Total amount of penalty applied to the classifier is called loss
Classifiers try to minimise loss by adjusting their parameters
C = 1
This instance increases
the penalty
231. Pixel#215
Pixel #213
254
2540
0
Total amount of penalty applied to the classifier is called loss
Classifiers try to minimise loss by adjusting their parameters
Now it is in a green area
In red are ares where
penalty is applied to
instances close to the line
In green are areas where
no penalty is applied
C = 1
232. Parameter tuning
Algorithm Hyper-parameters
K-nearest
neighbour
K - number of neighbours, (1,…,100)
Decision Tree Metric (‘gini’, ‘information gain’)
Random Forest
Number of trees (3,…,100, more better),
metric (‘gini’, information gain’)
SVM C (10-5,…,102) and gamma (10-15,…,102)
245. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
5/7
246. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
5/7
247. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
5/7
248. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
5/7
249. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
7/7
250. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
7/7
Testing 20%
251. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
7/7
Testing 20%
For one shot evaluation
of trained model
5/5
252. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
7/7
Testing 20%
For one shot evaluation
of trained model
5/5
But what happens when you overfit
validation set?
253. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
Testing 20%
For one shot evaluation
of trained model
5/5You’re doing great!
🙂
254. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
Testing 20%
For one shot evaluation
of trained model
5/5You’re doing great!
🙂
255. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
Testing 20%
For one shot evaluation
of trained model
4/5You’re doing great!
🙂
256. The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
Testing 20%
For one shot evaluation
of trained model
4/5You’re doing great!
🙂 😒
261. Training data 80%
Cross Validation (CV) Algorithm
20%20%20% 20%
Train on 60% of data Validate on
20%
20%20%20% 20%
262. 20%20%20% 20%
Training data 80%
Cross Validation (CV) Algorithm
TrainTrainTrain Val
Train on 60% of data Validate on
20%
263. Cross Validation (CV) Algorithm
0.75
20%20%20% 20%
Training data 80%
TrainTrainTrain Val
Train on 60% of data Validate on
20%
264. Cross Validation (CV) Algorithm
0.75
ValTrainTrain Train 0.85
20%20%20% 20%
Training data 80%
TrainTrainTrain Val
265. 20%20%20% 20%
Training data 80%
Cross Validation (CV) Algorithm
0.75
ValTrainTrain Train 0.85
TrainTrainTrain Val
TrainValTrain Train 0.91
266. Cross Validation (CV) Algorithm
0.75
0.85
TrainTrainVal Train
0.91
0.68
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
267. TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
Cross Validation (CV) Algorithm
0.75
0.85
0.91
0.68
MEAN (0.75, 0.85, 0.91, 0.68) = ?
268. TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
Cross Validation (CV) Algorithm
0.75
0.85
0.91
0.68
MEAN (0.75, 0.85, 0.91, 0.68) = 0.75
269. TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
Cross Validation (CV) Algorithm
0.75
0.85
0.91
0.68
MEAN (0.75, 0.85, 0.91, 0.68) = 0.75
Choose the best model/paramters based on
this estimate and then apply it to test set
279. Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set
Machine Learning pipeline
280. Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set
Evaluate final
model on
the test set
test set
Machine Learning pipeline
281. Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set
Evaluate final
model on
the test set
test set
Machine Learning pipeline
Report your results
282. Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set
Evaluate final
model on
the test set
test set
Machine Learning pipeline
Report your results
Problem
283. A machine learning algorithm usually
corresponds to a combination of the
following 3 elements
The choice of a specific mapping function family F (K-NN,
SVM, DT, RF, Neural Networks etc.).
284. A machine learning algorithm usually
corresponds to a combination of the
following 3 elements
Way to evaluate the quality of a function f out of F. Ways of
saying how bad/good this function f is doing in classifying
real world objects.
The choice of a specific mapping function family F (K-NN,
SVM, DT, RF, Neural Networks etc.).
285. A machine learning algorithm usually
corresponds to a combination of the
following 3 elements
a way to search for a better function f out of F. How to
choose parameters so that the performance of f would
improve.
Way to evaluate the quality of a function f out of F. Ways of
saying how bad/good this function f is doing in classifying
real world objects.
The choice of a specific mapping function family F (K-NN,
SVM, DT, RF, Neural Networks etc.).
288. References
• Machine Learning by Andrew Ng (https://www.coursera.org/learn/machine-
learning)
• Introduction to Machine Learning by Pascal Vincent given at Deep Learning
Summer School, Montreal 2015 (http://videolectures.net/
deeplearning2015_vincent_machine_learning/)
• Welcome to Machine Learning by Konstantin Tretyakov delivered at AACIMP
Summer School 2015 (http://kt.era.ee/lectures/aacimp2015/1-intro.pdf)
• Stanford CS class: Convolutional Neural Networks for Visual Recognition by
Andrej Karpathy (http://cs231n.github.io/)
• Data Mining Course by Jaak Vilo at University of Tartu (https://courses.cs.ut.ee/
MTAT.03.183/2017_spring/uploads/Main/DM_05_Clustering.pdf)
• Machine Learning Essential Conepts by Ilya Kuzovkin (https://
www.slideshare.net/iljakuzovkin)
• From the brain to deep learning and back by Raul Vicente Zafra and Ilya
Kuzovkin (http://www.uttv.ee/naita?id=23585&keel=eng)