Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Introduction to Machine
Learning
(Supervised learning)
Dmytro Fishman (dmytro@ut.ee)
This is an introduction to the topic
This is an introduction to the topic
We will try to provide a beautiful scenery
“We love you, Mummy!”
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 ...
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 ...
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 ...
Petal
Sepal
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 ...
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Big DataBig Data: Astronomical or Genomical? ...
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Big Data
Astronomical?
Youtubical?
Big Data: ...
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Big Data
Astronomical? Genomical?
Youtubical?...
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Big Data
Astronomical? Genomical?
Youtubical?...
There is a lot of data
produced nowadays
But there are also a vast number of potential ways to use this
data
Supervised Learning
Benign Malignant
Skin cancer example
Malignant?
Tumour size
Benign Malignant
Skin cancer example
Yes(1)
No(0)
Malignant?
Tumour size
Benign Malignant
Skin cancer example
Yes(1)
No(0)
Malignant?
Tumour size
Benign Malignant
Skin cancer example
Yes(1)
No(0)
Malignant?
Tumour size
Yes(1)
Benign Malignant
Skin cancer example
No(0)
Malignant?
Tumour size
Yes(1)
No(0)
Benign Malignant
Skin cancer example
Malignant?
Tumour size
Age
Tumour size
Age
Malignant?
Tumour size
Age
Malignant?
Tumour size
Age
Malignant?
Other features:
Lesion type
Lesion configuration
Texture
Location
Distribution
…
Potentially infi...
Classification task
Predicting discrete value output using previously labeled
examples
Binary classification
Classification task
Predicting discrete value output using previously labeled
examples
also binary classification
Classification task
Predicting discrete value output using previously labeled
examples
also binary classification
Every time...
Classification task
Multiclass classification
Predicting discrete value output using previously labeled
examples
Housing price prediction
Housing price prediction
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Housing price prediction
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Housing price prediction
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price?
Housing price prediction
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price?
Housing price prediction
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price
Housing price prediction
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price
Regression task
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price
Malignant?
Tumour size
Yes(1)
No(0)
Benign Malignant
Malignant?
VS
Classification Regression
Supervised Learning
Size in m2...
You are running a company which has two problems,
namely:

Q:
a. Both problems are examples of classification problems
b. T...
You are running a company which has two problems,
namely:

Q:
a. Both problems are examples of classification problems
b. T...
Tumour size
Age
Supervised Learning
Tumour size
Age
Unsupervised Learning
Tumour size
Age
Unsupervised Learning
Is there any interesting hidden structure in this data?
Tumour size
Age
Unsupervised Learning
Is there any interesting hidden structure in this data?
What does this hidden struct...
Gene expression
Gene expression
Two interesting
groups of
species
Gene expression
Two interesting
groups of genes
Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subs...
Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subs...
Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subs...
Q3: Assume you want to perform supervised learning and
to predict number of newborns according to size of stork
population...
Q4: Discriminating between spam and ham e-mails is a
classification task.
Q3: Assume you want to perform supervised learnin...
MNIST dataset
(10000 images)
Instance Label
28px
28px
3
MNIST dataset
(10000 images)
Instance Label0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255...
MNIST dataset
(10000 images)
In total 784 pixel values
Instance Label0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
2...
Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 ...
Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 ...
Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 ...
Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 ...
Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 ...
Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0...
Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 ...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
What about computing their pixel-wise differen...
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci|Σ
784
|Ai - Bi|i i
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci|Σ
784
|Ai - Bi|i i
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci|Σ
784
|Ai - Bi|i i
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 1...
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 1...
How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 1...
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03
How to quantitatively say which of
these pairs are more similar?
& &
A B ...
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03
How to quantitatively say which of
these pairs are more similar?
& &
A B ...
Instance Label
?
DatasetFor each new instance
We asked our friend to
write a bunch of new
digits so that we can
have somet...
Instance Label
?
DatasetFor each new instance
Instance Label
?
1.Compute pixel-wise
distance to all training
examples
For each new instance Dataset
Instance Label
?
1.Compute pixel-wise
distance to all training
examples
For each new instance Dataset
Instance Label
?
1.Compute pixel-wise
distance to all training
examples
For each new instance Dataset
Instance Label
?
1.Compute pixel-wise
distance to all training
examples
For each new instance Dataset
Instance Label
?
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
For each new ...
Instance Label
?
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
For each new ...
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
3. Report its...
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
Advantages of...
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
Advantages of...
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
Advantages of...
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
Advantages of...
Curse of dimensionality
Remember we said
that our instances are
784 dimensional?
Curse of dimensionality
Remember we said
that our instances are
784 dimensional?
This is a lot!
http://cs231n.github.io/classification/
http://cs231n.github.io/classification/
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training example
Advantages of...
For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training...
For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training...
VS
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 ...
VS
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 ...
VS
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 ...
VS
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 ...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Instances
pixel #216
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 2...
VS
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 ...
VS
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 ...
VS
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 ...
Decision (classification) tree algorithm
1.Construct a decision
tree based on training
examples
Decision (classification) tree algorithm
1.Construct a decision
tree based on training
examples
pixel #213
> 163 <= 163
pix...
Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
?
2.Make correspondin...
Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
?
2.Make correspondin...
Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6
3. Report label
1.C...
Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6 Depth=2
Once the tr...
Decision (classification) tree algorithm
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6 Depth=2
In general ...
Can we find a better
algorithm?
Disadvantages of NN
Very slow classification time
Suffers from
the curse of dimensionality
Can we find a better
algorithm?
Disadvantages of NNDisadvantages of DT
Very slow classification time Very slow classification...
Can we find a better
algorithm?
Disadvantages of NNDisadvantages of DT
Also suffers from
the curse of dimensionality
Very s...
Is there a way to break the curse?
Is there a way to break the curse?
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
The shape of the...
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic
This means that ...
The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree alg...
The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree alg...
The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree alg...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Randomly discard some rows and columns
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 ...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #...
Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #...
Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #...
Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #...
Random forest algorithm
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #...
Random forest algorithm
Instance Label
6
For each new instance Predictions
Tree #2
Tree #1
Tree #3
Average 2/3 = 66.6%
Qui...
Q1:
Which classification algorithm(s) has(ve) the following
weaknesses:
• It takes more time to train the classifier then to...
Q1:
A. Nearest neighbour algorithm
B. Decision tree
C. Random forest algorithm
D. None of the above
E. All of the above
• ...
Q2:
A. Prohibitively slow running time at
training given a lot of data
B. Highly biased classification due
prevalence of on...
Q2:
Which of the following statements best defines the curse of
dimensionality
A. Prohibitively slow running time at
traini...
Q3:
Which of the following algorithms you would prefer If you
would have to classify instances from low-dimensional data?
...
A. Nearest neighbour algorithm
B. Decision tree algorithm
C. Random forest algorithm
D. All mentioned would cope
E. None o...
Support Vector
Machine
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Let us go primitive, and focus only on two
pixels
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Now, let’s visualise them on a 2-D plotPixel#215
...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Now, let’s visualise them on a 2-D plotPixel#215
...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Now, let’s visualise them on a 2-D plotPixel#215
...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Now, let’s visualise them on a 2-D plotPixel#215
...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Support Vector Machine (SVM)Pixel#215
Pixel #213
...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the ri...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the ri...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the ri...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the ri...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the ri...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the ri...
Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the ri...
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hy...
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hy...
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hy...
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
2. Maximise the distance
between nearest points
and a hy...
Pixel#215
Pixel #213
254
2540
0
Support Vector Machine (SVM)
What should we do now?
y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
z
x
2540
0
y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
z
x
2540
0
y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
z
x
2540
0
y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
z
x
2540
0
This transformation is c...
y
x
254
2540
0
Support Vector Machine (SVM)
Let’s make another dimension
z a*x2 b*y2+=
z
x
2540
0
This transformation is c...
For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training...
For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training...
For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training...
For each test example
Instance Label
3
1.Compute pixel-wise
distance to all training
examples
2. Find the closest
training...
Quiz time
Q:
How would you approach a multi-classification task using
SVM?
Q:
How would you approach a multi-classification task using
SVM?
Pixel#215
Pixel #213
254
2540
0
Support Vector Machine (SVM)
Support Vector Machine (SVM)
Support Vector Machine (SVM)
Support Vector Machine (SVM)
100% accurate!
100% accurate!
accuracy =
correctly classified instances
total number of instances
100% accurate!
Can we trust this model?
100% accurate!
Can we trust this model?
Consider the following example:
100% accurate!
Can we trust this model?
Consider the following example:
Whatever happens,
predict 0
100% accurate!
Can we trust this model?
Consider the following example:
Whatever happens,
predict 0
Accuracy = 49/50
100% accurate!
Can we trust this model?
Consider the following example:
Whatever happens,
predict 0
Accuracy = 98%
100% accurate!
Can we trust this model?
Consider the following example:
Count
Histogram could help you figure
out if your dataset is unbal...
Can we trust this model?
Consider the following example:
What if my data is unbalanced?
Count
Histogram could help you figu...
Can we trust this model?
Consider the following example:
There are few ways, we are
going to discuss them later Count
What...
Can we trust this model?
In our case data is balanced:
100% accurate!
100% accurate!
Can we trust this model?
We have balanced data:
100% accurate!
Can we trust this model?
We have balanced data:
100% accurate!
Can we trust this model?
We have balanced data:
100% accurate!
Can we trust this model?
We have balanced data:
😒
So, what happened?
100% accurate!
Training the model
Feature#2
Feature #1
Let’s add more examples
Training the model
Feature#2
Feature #1
Training the model
Still linearly separable
Feature#2
Feature #1
Still linearly separable
Training the model
Feature#2
Feature #1
Training the model
Feature#2
Feature #1
Training the model
Feature#2
Feature #1
How about now?
Feature#2
Feature #1
Training the model
Feature#2
Feature #1
Simple; not perfect fit Complicated; ideal fit
Which model we should use?
Training the model
Feature#2
Feature #1
Feature#2
...
Simple; not perfect fit Complicated; ideal fit
Training the model
Which model we should use?
Feature#2
Feature #1
Feature#2
...
Simple; not perfect fit Complicated; ideal fit
Training the model
Which model we should use?
Feature#2
Feature #1
Feature#2
...
Simple; not perfect fit Complicated; ideal fit
Training the model
Which model we should use?
Feature#2
Feature #1
Feature#2
...
So, what happened?
Overfitting
100% accurate!
So, what happened?
Too general
model
Just right! Overfitting
100% accurate!
So, what happened?
Too general
model
Just right! Overfitting
We should split our data into train and test sets
100% accurat...
Split into train and test
Split into train and test
Normally we would
split data into 80%
train and 20% test
sets
Split into train and test
Normally we would
split data into 80%
train and 20% test
sets
As we have a lot of
data we can af...
Split into train and test
Can we do better than 90%?
Normally we would
split data into 80%
train and 20% test
sets
As we h...
Parameter tuning
Model
hyper-parameter
Pixel#215
Pixel #213
254
2540
0
Model
hyper-parameter
Pixel#215
Pixel #213
254
2540
0
C = 1
Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
C = 1
Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
In green are areas...
Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
In green are areas...
Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
In green are areas...
Pixel#215
Pixel #213
254
2540
0
Total amount of penalty applied to the classifier is called loss
Classifiers try to minimise...
Parameter tuning
Algorithm Hyper-parameters
K-nearest
neighbour
K - number of neighbours, (1,…,100)
Decision Tree Metric (...
Let’s try different C maybe
our score will improve
Let’s try different C maybe
our score will improve
Nope…
Let’s try different C maybe
our score will improve
Fail again…
Let’s try different C maybe
our score will improve
It is getting depressing…
Let’s try different C maybe
our score will improve
Hurrah!
Let’s try different C maybe
our score will improve
Hurrah!
You may not have noticed but…
Let’s try different C maybe
our score will improve
Hurrah!
You may not have noticed but…
We are overfitting again…
The whole
dataset 100%
Training 60%
The whole
dataset 100%
Training 60%
For fitting initial model
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
The whole
dataset 100%
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole
dataset 100%
Training 60%
For fitting initial model
Validation 20%
For parameter tuning &
performance evaluation
...
The whole dataset 100%
Cross Validation (CV) Algorithm
Training data 80%
Cross Validation (CV) Algorithm
Test 20%
Training data 80%
Cross Validation (CV) Algorithm
20%20%20% 20%
Training data 80%
Cross Validation (CV) Algorithm
Training data 80%
Cross Validation (CV) Algorithm
20%20%20% 20%
Train on 60% of data Validate on
20%
20%20%20% 20%
20%20%20% 20%
Training data 80%
Cross Validation (CV) Algorithm
TrainTrainTrain Val
Train on 60% of data Validate on
20%
Cross Validation (CV) Algorithm
0.75
20%20%20% 20%
Training data 80%
TrainTrainTrain Val
Train on 60% of data Validate on
...
Cross Validation (CV) Algorithm
0.75
ValTrainTrain Train 0.85
20%20%20% 20%
Training data 80%
TrainTrainTrain Val
20%20%20% 20%
Training data 80%
Cross Validation (CV) Algorithm
0.75
ValTrainTrain Train 0.85
TrainTrainTrain Val
TrainVal...
Cross Validation (CV) Algorithm
0.75
0.85
TrainTrainVal Train
0.91
0.68
20%20%20% 20%
Training data 80%
ValTrainTrain Trai...
TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
Cross Vali...
TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
Cross Vali...
TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
Cross Vali...
Machine Learning pipeline
Raw Data
Machine Learning pipeline
Raw Data Preprocessing
Machine Learning pipeline
Raw Data Preprocessing
Feature
extraction
Machine Learning pipeline
Raw Data Preprocessing
Feature
extraction
Split into
train & test
Machine Learning pipeline
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Machine Learning pipeline
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Machine Learning pipeline
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Ma...
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Ma...
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Tr...
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Tr...
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Tr...
Raw Data Preprocessing
Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Tr...
A machine learning algorithm usually
corresponds to a combination of the
following 3 elements
The choice of a specific mapp...
A machine learning algorithm usually
corresponds to a combination of the
following 3 elements
Way to evaluate the quality ...
A machine learning algorithm usually
corresponds to a combination of the
following 3 elements
a way to search for a better...
https://github.com/sugyan/tensorflow-mnist
References
• Machine Learning by Andrew Ng (https://www.coursera.org/learn/machine-
learning)
• Introduction to Machine Le...
www.biit.cs.ut.ee www.ut.ee www.quretec.ee
You, guys, rock!
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
1 Supervised learning
Nächste SlideShare
Wird geladen in …5
×

1 Supervised learning

The first lecture from the Machine Learning course series of lectures. The lecture covers basic principles of machine learning, such as the difference between supervised and unsupervised learning, several classifiers: nearest neighbour (k-NN), decision trees, random forest, major obstacles in machine learning: overfitting and the curse of dimensionality, followed by cross-validation algorithm and general ML pipeline. A link to my github (https://github.com/skyfallen/MachineLearningPracticals) with practicals that I have designed for this course in both R and Python. I can share keynote files, contact me via e-mail: dmytro.fishman@ut.ee.

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen
  • Als Erste(r) kommentieren

1 Supervised learning

  1. 1. Introduction to Machine Learning (Supervised learning) Dmytro Fishman (dmytro@ut.ee)
  2. 2. This is an introduction to the topic
  3. 3. This is an introduction to the topic We will try to provide a beautiful scenery
  4. 4. “We love you, Mummy!”
  5. 5. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 “We love you, Mummy!”
  6. 6. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 “We love you, Mummy!” Petal Sepal
  7. 7. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 “We love you, Mummy!” Petal Sepal Word 1 25 Word 2 23 Word 3 12 … …
  8. 8. Petal Sepal 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 Word 1 25 Word 2 23 Word 3 12 … … “We love you, Mummy!”
  9. 9. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 Big DataBig Data: Astronomical or Genomical? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  10. 10. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 Big Data Astronomical? Youtubical? Big Data: Astronomical or Genomical? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 Genomical?
  11. 11. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 Big Data Astronomical? Genomical? Youtubical? 1 Exabyte/year 2-40 Exabyte/year 1-2 Exabyte/year Big Data: Astronomical or Genomical? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  12. 12. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 Big Data Astronomical? Genomical? Youtubical? 1 Exabyte/year 2-40 Exabyte/year 1-2 Exabyte/year Big Data: Astronomical or Genomical? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 1 Exabyte =1012 Mb
  13. 13. There is a lot of data produced nowadays But there are also a vast number of potential ways to use this data
  14. 14. Supervised Learning
  15. 15. Benign Malignant Skin cancer example
  16. 16. Malignant? Tumour size Benign Malignant Skin cancer example Yes(1) No(0)
  17. 17. Malignant? Tumour size Benign Malignant Skin cancer example Yes(1) No(0)
  18. 18. Malignant? Tumour size Benign Malignant Skin cancer example Yes(1) No(0)
  19. 19. Malignant? Tumour size Yes(1) Benign Malignant Skin cancer example No(0)
  20. 20. Malignant? Tumour size Yes(1) No(0) Benign Malignant Skin cancer example Malignant?
  21. 21. Tumour size Age
  22. 22. Tumour size Age Malignant?
  23. 23. Tumour size Age Malignant?
  24. 24. Tumour size Age Malignant? Other features: Lesion type Lesion configuration Texture Location Distribution … Potentially infinitely many features!
  25. 25. Classification task Predicting discrete value output using previously labeled examples Binary classification
  26. 26. Classification task Predicting discrete value output using previously labeled examples also binary classification
  27. 27. Classification task Predicting discrete value output using previously labeled examples also binary classification Every time you have to distinguish between TWO CLASSES it is a binary classification
  28. 28. Classification task Multiclass classification Predicting discrete value output using previously labeled examples
  29. 29. Housing price prediction
  30. 30. Housing price prediction Size in m2 Price in 1000’s ($) 400 100 200 300 100 200 300 400 500
  31. 31. Housing price prediction Size in m2 Price in 1000’s ($) 400 100 200 300 100 200 300 400 500
  32. 32. Housing price prediction Size in m2 Price in 1000’s ($) 400 100 200 300 100 200 300 400 500 Price?
  33. 33. Housing price prediction Size in m2 Price in 1000’s ($) 400 100 200 300 100 200 300 400 500 Price?
  34. 34. Housing price prediction Size in m2 Price in 1000’s ($) 400 100 200 300 100 200 300 400 500 Price
  35. 35. Housing price prediction Size in m2 Price in 1000’s ($) 400 100 200 300 100 200 300 400 500 Price
  36. 36. Regression task Size in m2 Price in 1000’s ($) 400 100 200 300 100 200 300 400 500 Price
  37. 37. Malignant? Tumour size Yes(1) No(0) Benign Malignant Malignant? VS Classification Regression Supervised Learning Size in m2 Pricein1000’s($) 400 100 200 300 100 200 300 400 500 Price?
  38. 38. You are running a company which has two problems, namely:
 Q: a. Both problems are examples of classification problems b. The first one is a classification task and the second one regression problem c. The first one is regression problem and the second one as classification task d. Both problems are regression problems 1. For each user in the database predict if this user will continue using your company’s product or will move to competitors (churn). 2. Predict the profit of your company at the end of this year based on previous records. How would you approach these problems?
  39. 39. You are running a company which has two problems, namely:
 Q: a. Both problems are examples of classification problems b. The first one is a classification task and the second one regression problem c. The first one is regression problem and the second one as classification task d. Both problems are regression problems 1. For each user in the database predict if this user will continue using your company’s product or will move to competitors (churn). 2. Predict the profit of your company at the end of this year based on previous records. How would you approach these problems?
  40. 40. Tumour size Age Supervised Learning
  41. 41. Tumour size Age Unsupervised Learning
  42. 42. Tumour size Age Unsupervised Learning Is there any interesting hidden structure in this data?
  43. 43. Tumour size Age Unsupervised Learning Is there any interesting hidden structure in this data? What does this hidden structure correspond to?
  44. 44. Gene expression
  45. 45. Gene expression Two interesting groups of species
  46. 46. Gene expression Two interesting groups of genes
  47. 47. Q1: Some telecommunication company wants to segment their customers into distinct groups in order to send appropriate subscription offers, this is an example of ... Q2: You are given data about seismic activity in Japan, and you want to predict a magnitude of the next earthquake, this is in an example of ... Q3: Assume you want to perform supervised learning and to predict number of newborns according to size of stork population (http://www.brixtonhealth.com/storksBabies.pdf), it is an example of ... Q4: Discriminating between spam and ham e-mails is a classification task, true or false?
  48. 48. Q1: Some telecommunication company wants to segment their customers into distinct groups in order to send appropriate subscription offers, this is an example of clustering Q2: You are given data about seismic activity in Japan, and you want to predict a magnitude of the next earthquake, this is in an example of ... Quiz: Assume you want to perform supervised learning and to predict number of newborns according to size of stork population (http://www.brixtonhealth.com/storksBabies.pdf), it is an example of ... Quiz: Discriminating between spam and ham e-mails is a classification task, true or false?
  49. 49. Q1: Some telecommunication company wants to segment their customers into distinct groups in order to send appropriate subscription offers, this is an example of clustering Q2: You are given data about seismic activity in Japan, and you want to predict a magnitude of the next earthquake, this is in an example of regression Q3: Assume you want to perform supervised learning and to predict number of newborns according to size of stork population (http://www.brixtonhealth.com/storksBabies.pdf), it is an example of ... Quiz: Discriminating between spam and ham e-mails is a classification task, true or false?
  50. 50. Q3: Assume you want to perform supervised learning and to predict number of newborns according to size of stork population (http://www.brixtonhealth.com/storksBabies.pdf), it is an example of stupidity regression Q4: Discriminating between spam and ham e-mails is a classification task, true or false? Q2: You are given data about seismic activity in Japan, and you want to predict a magnitude of the next earthquake, this is in an example of regression Q1: Some telecommunication company wants to segment their customers into distinct groups in order to send appropriate subscription offers, this is an example of clustering
  51. 51. Q4: Discriminating between spam and ham e-mails is a classification task. Q3: Assume you want to perform supervised learning and to predict number of newborns according to size of stork population (http://www.brixtonhealth.com/storksBabies.pdf), it is an example of stupidity regression Q2: You are given data about seismic activity in Japan, and you want to predict a magnitude of the next earthquake, this is in an example of regression Q1: Some telecommunication company wants to segment their customers into distinct groups in order to send appropriate subscription offers, this is an example of clustering
  52. 52. MNIST dataset (10000 images) Instance Label 28px 28px 3
  53. 53. MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px 3
  54. 54. MNIST dataset (10000 images) In total 784 pixel values Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) 3
  55. 55. Pixel values Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values
  56. 56. Pixel values Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values Feature
  57. 57. Pixel values Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values Feature
  58. 58. Pixel values Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values Feature
  59. 59. Pixel values Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel valuesFeatures are also some times referred to as dimensions
  60. 60. Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel valuesFeatures are also some times referred to as dimensions This images are 784 dimensional
  61. 61. Pixel values Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values
  62. 62. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values Data is loaded. What should we do now?
  63. 63. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values Data is loaded. What should we do now? We would like to build a tool that would be able to automatically recognise handwritten images
  64. 64. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1 Instances MNIST dataset (10000 images) Instance Label0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 255 255 255 155 0 255 255 255 255 255 255 255 255 155 78 78 155 255 255 255 0 0 0 0 155 255 28px 28px (0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …) could be downloaded from: http://yann.lecun.com/exdb/mnist/ 3 In total 784 pixel values Data is loaded. What should we do now? We would like to build a tool that would be able to automatically recognise handwritten images Let’s get to the first algorithm
  65. 65. How to quantitatively say which of these pairs are more similar? & & A B CA OR
  66. 66. How to quantitatively say which of these pairs are more similar? & & A B CA What about computing their pixel-wise difference? OR
  67. 67. How to quantitatively say which of these pairs are more similar? & & A B CA OR Σ 784 |Ai - Ci|Σ 784 |Ai - Bi|i i
  68. 68. How to quantitatively say which of these pairs are more similar? & & A B CA OR Σ 784 |Ai - Ci|Σ 784 |Ai - Bi|i i
  69. 69. How to quantitatively say which of these pairs are more similar? & & A B CA OR Σ 784 |Ai - Ci|Σ 784 |Ai - Bi|i i
  70. 70. How to quantitatively say which of these pairs are more similar? & & A B CA OR Σ 784 |Ai - Ci| = 107.38Σ 784 |Ai - Bi| = 137.03i i
  71. 71. How to quantitatively say which of these pairs are more similar? & & A B CA OR Σ 784 |Ai - Ci| = 107.38Σ 784 |Ai - Bi| = 137.03i i A is more similar to C than B
  72. 72. How to quantitatively say which of these pairs are more similar? & & A B CA OR Σ 784 |Ai - Ci| = 107.38Σ 784 |Ai - Bi| = 137.03i i A is more similar (closer) to C than B
  73. 73. Σ 784 |Ai - Ci| = 107.38Σ 784 |Ai - Bi| = 137.03 How to quantitatively say which of these pairs are more similar? & & A B CA i i OR A is more similar (closer) to C than B
  74. 74. Σ 784 |Ai - Ci| = 107.38Σ 784 |Ai - Bi| = 137.03 How to quantitatively say which of these pairs are more similar? & & A B CA i i OR A is more similar (closer) to C than B
  75. 75. Instance Label ? DatasetFor each new instance We asked our friend to write a bunch of new digits so that we can have something to recognise, here is the first one of them
  76. 76. Instance Label ? DatasetFor each new instance
  77. 77. Instance Label ? 1.Compute pixel-wise distance to all training examples For each new instance Dataset
  78. 78. Instance Label ? 1.Compute pixel-wise distance to all training examples For each new instance Dataset
  79. 79. Instance Label ? 1.Compute pixel-wise distance to all training examples For each new instance Dataset
  80. 80. Instance Label ? 1.Compute pixel-wise distance to all training examples For each new instance Dataset
  81. 81. Instance Label ? 1.Compute pixel-wise distance to all training examples 2. Find the closest training example For each new instance Dataset
  82. 82. Instance Label ? 1.Compute pixel-wise distance to all training examples 2. Find the closest training example For each new instance Dataset
  83. 83. Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example 3. Report its label Nearest Neighbour classifier For each new instance Dataset
  84. 84. Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example Advantages of NN Disadvantages of NN For each new instance 3. Report its label
  85. 85. Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example Advantages of NN Disadvantages of NN Very easy to implement For each new instance 3. Report its label
  86. 86. Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example Advantages of NN Disadvantages of NN Very easy to implement Very slow classification time For each new instance 3. Report its label
  87. 87. Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example Advantages of NN Disadvantages of NN Very easy to implement Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems For each new instance Very slow classification time 3. Report its label
  88. 88. Curse of dimensionality Remember we said that our instances are 784 dimensional?
  89. 89. Curse of dimensionality Remember we said that our instances are 784 dimensional? This is a lot!
  90. 90. http://cs231n.github.io/classification/
  91. 91. http://cs231n.github.io/classification/
  92. 92. Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example Advantages of NN Disadvantages of NN Very easy to implement Very slow classification time Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems For each new instance 3. Report its label
  93. 93. For each test example Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example 3. Report it’s label Advantages of NN Disadvantages of NN Fast training time O(C) Very easy to implement Very slow classification time Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems NN is rarely used in practice
  94. 94. For each test example Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example 3. Report it’s label Advantages of NN Disadvantages of NN Fast training time O(C) Very easy to implement Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems Can we find a better algorithm? Very slow classification time
  95. 95. VS Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Back to binary classification *for a sec
  96. 96. VS Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Back to binary classification *for a sec pixel #213 pixel #213
  97. 97. VS Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Back to binary classification *for a sec pixel #213 > 163 <= 163 pixel #213
  98. 98. VS Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Back to binary classification *for a sec pixel #213 > 163 <= 163 pixel #216 pixel #216 > 30 <= 30
  99. 99. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances pixel #216 VS Back to binary classification *for a sec pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Decision tree
  100. 100. Instances pixel #216 Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 pixel #216 > 30 <= 30 VS Back to binary classification *for a sec pixel #213 > 163 <= 163 Split
  101. 101. VS Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Back to binary classification pixel #213 > 163 <= 163 pixel #216 pixel #216 > 30 <= 30
  102. 102. VS Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Back to binary classification *for a sec pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 How do you know which features to use for best splits? Split
  103. 103. VS Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Back to binary classification *for a sec pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 How do you know which features to use for best splits? Split Using various goodness metrics such as information gain or gini impurity to define “best”
  104. 104. Decision (classification) tree algorithm 1.Construct a decision tree based on training examples
  105. 105. Decision (classification) tree algorithm 1.Construct a decision tree based on training examples pixel #213 > 163 <= 163 pixel #216 > 30 <= 30
  106. 106. Decision (classification) tree algorithm pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Instance Label ? 2.Make corresponding comparisons 1.Construct a decision tree based on training examples #213 For each new instance
  107. 107. Decision (classification) tree algorithm pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Instance Label ? 2.Make corresponding comparisons 1.Construct a decision tree based on training examples #213 #216 For each new instance
  108. 108. Decision (classification) tree algorithm pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Instance Label 6 3. Report label 1.Construct a decision tree based on training examples 2.Make corresponding comparisons #213 #216 For each new instance
  109. 109. Decision (classification) tree algorithm pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Instance Label 6 Depth=2 Once the tree is constructed maximum 2 comparisons would be needed to test a new example3. Report label 1.Construct a decision tree based on training examples 2.Make corresponding comparisons #213 #216 For each new instance
  110. 110. Decision (classification) tree algorithm pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Instance Label 6 Depth=2 In general decision trees are *always faster than NN algorithm3. Report label 1.Construct a decision tree based on training examples 2.Make corresponding comparisons *remember, shit happens #213 #216 For each new instance
  111. 111. Can we find a better algorithm? Disadvantages of NN Very slow classification time Suffers from the curse of dimensionality
  112. 112. Can we find a better algorithm? Disadvantages of NNDisadvantages of DT Very slow classification time Very slow classification time Suffers from the curse of dimensionality
  113. 113. Can we find a better algorithm? Disadvantages of NNDisadvantages of DT Also suffers from the curse of dimensionality Very slow classification time Very slow classification time Suffers from the curse of dimensionality
  114. 114. Is there a way to break the curse?
  115. 115. Is there a way to break the curse?
  116. 116. pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Decision tree algorithm is non-parametric and deterministic
  117. 117. pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Decision tree algorithm is non-parametric and deterministic The shape of the tree is determined by data not our choice
  118. 118. pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Decision tree algorithm is non-parametric and deterministic This means that we will always have the same output given the same input… The shape of the tree is determined by data not our choice
  119. 119. The shape of the tree is determined by data not our choice pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Decision tree algorithm is non-parametric and deterministic This means that we will always have the same output given the same input… Are all input dimensions equally important for classification?
  120. 120. The shape of the tree is determined by data not our choice pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Decision tree algorithm is non-parametric and deterministic This means that we will always have the same output given the same input… How about building a lot of trees from random parts of the data and then merging their predictions? Are all input dimensions equally important for classification?
  121. 121. The shape of the tree is determined by data not our choice pixel #213 > 163 <= 163 pixel #216 > 30 <= 30 Decision tree algorithm is non-parametric and deterministic This means that we will always have the same output given the same input… How about building a lot of trees from random parts of the data and then merging their predictions? Are all input dimensions equally important for classification? Random forest algorithm
  122. 122. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Random forest algorithm
  123. 123. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Random forest algorithm Randomly discard some rows
  124. 124. Randomly discard some rows and columns Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Random forest algorithm
  125. 125. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Random forest algorithm Build a decision tree based on remaining data pixel #213 > 163 <= 163 pixel #216 > 0 = 0
  126. 126. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 Build a decision tree based on remaining data Repeat N times until N trees are constructed
  127. 127. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163
  128. 128. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163 pixel #214 > 253 <= 253 pixel #216 > 30 <= 30
  129. 129. Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163 pixel #214 > 253 <= 253 pixel #216 > 30 <= 30 Instance Label ? For each new instance
  130. 130. Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163 pixel #214 > 253 <= 253 pixel #216 > 30 <= 30 Instance Label ? For each new instance Use all constructed trees to generate predictions
  131. 131. Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163 pixel #214 > 253 <= 253 pixel #216 > 30 <= 30 Instance Label ? For each new instance Predictions Tree #2 Tree #1 Tree #3
  132. 132. Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163 pixel #214 > 253 <= 253 pixel #216 > 30 <= 30 Instance Label For each new instance Predictions Tree #2 Tree #1 Tree #3? Average 2/3
  133. 133. Random forest algorithm pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163 pixel #214 > 253 <= 253 pixel #216 > 30 <= 30 Instance Label For each new instance Predictions Tree #2 Tree #1 Tree #36 Average 2/3 = 66.6%
  134. 134. Random forest algorithm Instance Label 6 For each new instance Predictions Tree #2 Tree #1 Tree #3 Average 2/3 = 66.6% Quiz time pixel #213 > 163 <= 163 pixel #216 > 0 = 0 pixel #213 > 163 <= 163 pixel #214 > 253 <= 253 pixel #216 > 30 <= 30
  135. 135. Q1: Which classification algorithm(s) has(ve) the following weaknesses: • It takes more time to train the classifier then to classify a new instance • It suffers from the curse of dimensionality A. Nearest neighbour algorithm B. Decision tree C. Random forest algorithm D. None of the above E. All of the above
  136. 136. Q1: A. Nearest neighbour algorithm B. Decision tree C. Random forest algorithm D. None of the above E. All of the above • It takes more time to train the classifier then to classify a new example • It suffers from the curse of dimensionality pixel #213 > 163 <= 163 pixel #216 > 0 = 0 Which classification algorithm(s) has(ve) the following weaknesses:
  137. 137. Q2: A. Prohibitively slow running time at training given a lot of data B. Highly biased classification due prevalence of one of the classes C. High classification error due to excessively complex classifier D. Poor performance of the classifier trained on data with large number of features E. None of the above Which of the following statements best defines the curse of dimensionality
  138. 138. Q2: Which of the following statements best defines the curse of dimensionality A. Prohibitively slow running time at training given a lot of data B. Highly biased classification due prevalence of one of the classes C. High classification error due to excessively complex classifier D. Poor performance of the classifier trained on data with large number of features E. None of the above
  139. 139. Q3: Which of the following algorithms you would prefer If you would have to classify instances from low-dimensional data? A. Nearest neighbour algorithm B. Decision tree algorithm C. Random forest algorithm D. All mentioned would cope E. None of the above are suitable
  140. 140. A. Nearest neighbour algorithm B. Decision tree algorithm C. Random forest algorithm D. All mentioned would cope E. None of the above are suitable Q3: Which of the following algorithms you would prefer if you would have to classify instances from low-dimensional data?
  141. 141. Support Vector Machine
  142. 142. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 202 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Let us go primitive, and focus only on two pixels
  143. 143. Feature vectors Labels 0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3 0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6 0 0 0 0 0 0 0 0 59 163 254 202 254 194 112 18 0 0 0 0 … 3 0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6 Instances Let us go primitive, and focus only on two pixels It does not really matter, which ones. I will take these two because we got use to them already :)
  144. 144. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Let us go primitive, and focus only on two pixels
  145. 145. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Now, let’s visualise them on a 2-D plotPixel#215 Pixel #213 254 2540 0
  146. 146. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Now, let’s visualise them on a 2-D plotPixel#215 Pixel #213 254 2540 0
  147. 147. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Now, let’s visualise them on a 2-D plotPixel#215 Pixel #213 254 2540 0
  148. 148. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Now, let’s visualise them on a 2-D plotPixel#215 Pixel #213 254 2540 0
  149. 149. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Support Vector Machine (SVM)Pixel#215 Pixel #213 254 2540 0
  150. 150. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane A B C Is it A, B or C? Support Vector Machine (SVM)
  151. 151. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Support Vector Machine (SVM)
  152. 152. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Margin Support Vector Machine (SVM)
  153. 153. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Margin Support Vector Machine (SVM)
  154. 154. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Support Vector Machine (SVM) Closest points that define hyper- plane are called support vectors
  155. 155. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane 3. Larger the distance from the hyper-plane to the instance, more confident the classifier about its prediction more confidence Support Vector Machine (SVM) Closest points that define hyper- plane are called support vectors
  156. 156. Features Labels 254 254 3 254 193 6 254 0 6 163 202 3 227 84 6 Instances Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Support Vector Machine (SVM) 3. Larger the distance from the hyper-plane to the instance, more confident the classifier about its prediction Closest points that define hyper- plane are called support vectors
  157. 157. Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Support Vector Machine (SVM) 3. Larger the distance from the hyper-plane to the instance, more confident the classifier about its prediction Closest points that define hyper- plane are called support vectors Instance Label ? For each new instance
  158. 158. Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Support Vector Machine (SVM) 3. Larger the distance from the hyper-plane to the instance, more confident the classifier about its prediction Closest points that define hyper- plane are called support vectors Instance Label ? For each new instance
  159. 159. Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Support Vector Machine (SVM) 3. Larger the distance from the hyper-plane to the instance, more confident the classifier about its prediction Closest points that define hyper- plane are called support vectors Instance Label 6 For each new instance
  160. 160. Pixel#215 Pixel #213 254 2540 0 1.Identify the right hyper- plane 2. Maximise the distance between nearest points and a hyper-plane Support Vector Machine (SVM) 3. Larger the distance from the hyper-plane to the instance, more confident the classifier about its prediction Closest points that define hyper- plane are called support vectors Instance Label 6 For each new instance
  161. 161. Pixel#215 Pixel #213 254 2540 0 Support Vector Machine (SVM) What should we do now?
  162. 162. y x 254 2540 0 Support Vector Machine (SVM) Let’s make another dimension z a*x2 b*y2+=
  163. 163. y x 254 2540 0 Support Vector Machine (SVM) Let’s make another dimension z a*x2 b*y2+= z x 2540 0
  164. 164. y x 254 2540 0 Support Vector Machine (SVM) Let’s make another dimension z a*x2 b*y2+= z x 2540 0
  165. 165. y x 254 2540 0 Support Vector Machine (SVM) Let’s make another dimension z a*x2 b*y2+= z x 2540 0
  166. 166. y x 254 2540 0 Support Vector Machine (SVM) Let’s make another dimension z a*x2 b*y2+= z x 2540 0 This transformation is called a kernel trick and function z is the kernel
  167. 167. y x 254 2540 0 Support Vector Machine (SVM) Let’s make another dimension z a*x2 b*y2+= z x 2540 0 This transformation is called a kernel trick and function z is the kernel Wow, wow, wow, hold on! How does this actually work?
  168. 168. For each test example Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example 3. Report it’s label Advantages of NN Disadvantages of NN Fast training time O(C) Very easy to implement Very slow classification time Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems Comparison with SVM Disadvantages of SVM Very slow classification time Suffers from the curse of dimensionality
  169. 169. For each test example Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example 3. Report it’s label Advantages of NN Disadvantages of NN Fast training time O(C) Very easy to implement Very slow classification time Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems Comparison with SVM Disadvantages of SVM Very slow classification time Suffers from the curse of dimensionality
  170. 170. For each test example Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example 3. Report it’s label Advantages of NN Disadvantages of NN Fast training time O(C) Very easy to implement Very slow classification time Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems Comparison with SVM Disadvantages of SVM Very slow classification time Suffers from the curse of dimensionality
  171. 171. For each test example Instance Label 3 1.Compute pixel-wise distance to all training examples 2. Find the closest training example 3. Report it’s label Advantages of NN Disadvantages of NN Fast training time O(C) Very easy to implement Very slow classification time Suffers from the curse of dimensionality Could be a good choice for low-dimensional problems Comparison with SVM Disadvantages of SVM Very slow classification time Suffers from the curse of dimensionality It might be tricky to choose the right kernel
  172. 172. Quiz time
  173. 173. Q: How would you approach a multi-classification task using SVM?
  174. 174. Q: How would you approach a multi-classification task using SVM? Pixel#215 Pixel #213 254 2540 0
  175. 175. Support Vector Machine (SVM)
  176. 176. Support Vector Machine (SVM)
  177. 177. Support Vector Machine (SVM)
  178. 178. Support Vector Machine (SVM) 100% accurate!
  179. 179. 100% accurate!
  180. 180. accuracy = correctly classified instances total number of instances 100% accurate!
  181. 181. Can we trust this model? 100% accurate!
  182. 182. Can we trust this model? Consider the following example: 100% accurate!
  183. 183. Can we trust this model? Consider the following example: Whatever happens, predict 0 100% accurate!
  184. 184. Can we trust this model? Consider the following example: Whatever happens, predict 0 Accuracy = 49/50 100% accurate!
  185. 185. Can we trust this model? Consider the following example: Whatever happens, predict 0 Accuracy = 98% 100% accurate!
  186. 186. Can we trust this model? Consider the following example: Count Histogram could help you figure out if your dataset is unbalanced 100% accurate!
  187. 187. Can we trust this model? Consider the following example: What if my data is unbalanced? Count Histogram could help you figure out if your dataset is unbalanced 100% accurate!
  188. 188. Can we trust this model? Consider the following example: There are few ways, we are going to discuss them later Count What if my data is unbalanced? Histogram could help you figure out if your dataset is unbalanced 100% accurate!
  189. 189. Can we trust this model? In our case data is balanced: 100% accurate!
  190. 190. 100% accurate! Can we trust this model? We have balanced data:
  191. 191. 100% accurate! Can we trust this model? We have balanced data:
  192. 192. 100% accurate! Can we trust this model? We have balanced data:
  193. 193. 100% accurate! Can we trust this model? We have balanced data: 😒
  194. 194. So, what happened? 100% accurate!
  195. 195. Training the model Feature#2 Feature #1 Let’s add more examples
  196. 196. Training the model Feature#2 Feature #1
  197. 197. Training the model Still linearly separable Feature#2 Feature #1
  198. 198. Still linearly separable Training the model Feature#2 Feature #1
  199. 199. Training the model Feature#2 Feature #1
  200. 200. Training the model Feature#2 Feature #1 How about now?
  201. 201. Feature#2 Feature #1 Training the model Feature#2 Feature #1
  202. 202. Simple; not perfect fit Complicated; ideal fit Which model we should use? Training the model Feature#2 Feature #1 Feature#2 Feature #1
  203. 203. Simple; not perfect fit Complicated; ideal fit Training the model Which model we should use? Feature#2 Feature #1 Feature#2 Feature #1
  204. 204. Simple; not perfect fit Complicated; ideal fit Training the model Which model we should use? Feature#2 Feature #1 Feature#2 Feature #1
  205. 205. Simple; not perfect fit Complicated; ideal fit Training the model Which model we should use? Feature#2 Feature #1 Feature#2 Feature #1
  206. 206. So, what happened? Overfitting 100% accurate!
  207. 207. So, what happened? Too general model Just right! Overfitting 100% accurate!
  208. 208. So, what happened? Too general model Just right! Overfitting We should split our data into train and test sets 100% accurate!
  209. 209. Split into train and test
  210. 210. Split into train and test Normally we would split data into 80% train and 20% test sets
  211. 211. Split into train and test Normally we would split data into 80% train and 20% test sets As we have a lot of data we can afford 50/50 ratio
  212. 212. Split into train and test Can we do better than 90%? Normally we would split data into 80% train and 20% test sets As we have a lot of data we can afford 50/50 ratio
  213. 213. Parameter tuning
  214. 214. Model hyper-parameter
  215. 215. Pixel#215 Pixel #213 254 2540 0 Model hyper-parameter
  216. 216. Pixel#215 Pixel #213 254 2540 0 C = 1
  217. 217. Pixel#215 Pixel #213 254 2540 0 In red are ares where penalty is applied to instances close to the line C = 1
  218. 218. Pixel#215 Pixel #213 254 2540 0 In red are ares where penalty is applied to instances close to the line In green are areas where no penalty is applied C = 1
  219. 219. Pixel#215 Pixel #213 254 2540 0 In red are ares where penalty is applied to instances close to the line In green are areas where no penalty is applied Total amount of penalty applied to the classifier is called loss Classifiers try to minimise loss by adjusting their parameters C = 1
  220. 220. Pixel#215 Pixel #213 254 2540 0 In red are ares where penalty is applied to instances close to the line In green are areas where no penalty is applied Total amount of penalty applied to the classifier is called loss Classifiers try to minimise loss by adjusting their parameters C = 1 This instance increases the penalty
  221. 221. Pixel#215 Pixel #213 254 2540 0 Total amount of penalty applied to the classifier is called loss Classifiers try to minimise loss by adjusting their parameters Now it is in a green area In red are ares where penalty is applied to instances close to the line In green are areas where no penalty is applied C = 1
  222. 222. Parameter tuning Algorithm Hyper-parameters K-nearest neighbour K - number of neighbours, (1,…,100) Decision Tree Metric (‘gini’, ‘information gain’) Random Forest Number of trees (3,…,100, more better), metric (‘gini’, information gain’) SVM C (10-5,…,102) and gamma (10-15,…,102)
  223. 223. Let’s try different C maybe our score will improve
  224. 224. Let’s try different C maybe our score will improve Nope…
  225. 225. Let’s try different C maybe our score will improve Fail again…
  226. 226. Let’s try different C maybe our score will improve It is getting depressing…
  227. 227. Let’s try different C maybe our score will improve Hurrah!
  228. 228. Let’s try different C maybe our score will improve Hurrah! You may not have noticed but…
  229. 229. Let’s try different C maybe our score will improve Hurrah! You may not have noticed but… We are overfitting again…
  230. 230. The whole dataset 100%
  231. 231. Training 60% The whole dataset 100%
  232. 232. Training 60% For fitting initial model The whole dataset 100%
  233. 233. Training 60% For fitting initial model Validation 20% The whole dataset 100%
  234. 234. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 5/7
  235. 235. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 5/7
  236. 236. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 5/7
  237. 237. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 5/7
  238. 238. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 7/7
  239. 239. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 7/7 Testing 20%
  240. 240. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 7/7 Testing 20% For one shot evaluation of trained model 5/5
  241. 241. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation 7/7 Testing 20% For one shot evaluation of trained model 5/5 But what happens when you overfit validation set?
  242. 242. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation Testing 20% For one shot evaluation of trained model 5/5You’re doing great! 🙂
  243. 243. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation Testing 20% For one shot evaluation of trained model 5/5You’re doing great! 🙂
  244. 244. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation Testing 20% For one shot evaluation of trained model 4/5You’re doing great! 🙂
  245. 245. The whole dataset 100% Training 60% For fitting initial model Validation 20% For parameter tuning & performance evaluation Testing 20% For one shot evaluation of trained model 4/5You’re doing great! 🙂 😒
  246. 246. The whole dataset 100% Cross Validation (CV) Algorithm
  247. 247. Training data 80% Cross Validation (CV) Algorithm Test 20%
  248. 248. Training data 80% Cross Validation (CV) Algorithm
  249. 249. 20%20%20% 20% Training data 80% Cross Validation (CV) Algorithm
  250. 250. Training data 80% Cross Validation (CV) Algorithm 20%20%20% 20% Train on 60% of data Validate on 20% 20%20%20% 20%
  251. 251. 20%20%20% 20% Training data 80% Cross Validation (CV) Algorithm TrainTrainTrain Val Train on 60% of data Validate on 20%
  252. 252. Cross Validation (CV) Algorithm 0.75 20%20%20% 20% Training data 80% TrainTrainTrain Val Train on 60% of data Validate on 20%
  253. 253. Cross Validation (CV) Algorithm 0.75 ValTrainTrain Train 0.85 20%20%20% 20% Training data 80% TrainTrainTrain Val
  254. 254. 20%20%20% 20% Training data 80% Cross Validation (CV) Algorithm 0.75 ValTrainTrain Train 0.85 TrainTrainTrain Val TrainValTrain Train 0.91
  255. 255. Cross Validation (CV) Algorithm 0.75 0.85 TrainTrainVal Train 0.91 0.68 20%20%20% 20% Training data 80% ValTrainTrain Train TrainTrainTrain Val TrainValTrain Train
  256. 256. TrainTrainVal Train 20%20%20% 20% Training data 80% ValTrainTrain Train TrainTrainTrain Val TrainValTrain Train Cross Validation (CV) Algorithm 0.75 0.85 0.91 0.68 MEAN (0.75, 0.85, 0.91, 0.68) = ?
  257. 257. TrainTrainVal Train 20%20%20% 20% Training data 80% ValTrainTrain Train TrainTrainTrain Val TrainValTrain Train Cross Validation (CV) Algorithm 0.75 0.85 0.91 0.68 MEAN (0.75, 0.85, 0.91, 0.68) = 0.75
  258. 258. TrainTrainVal Train 20%20%20% 20% Training data 80% ValTrainTrain Train TrainTrainTrain Val TrainValTrain Train Cross Validation (CV) Algorithm 0.75 0.85 0.91 0.68 MEAN (0.75, 0.85, 0.91, 0.68) = 0.75 Choose the best model/paramters based on this estimate and then apply it to test set
  259. 259. Machine Learning pipeline
  260. 260. Raw Data Machine Learning pipeline
  261. 261. Raw Data Preprocessing Machine Learning pipeline
  262. 262. Raw Data Preprocessing Feature extraction Machine Learning pipeline
  263. 263. Raw Data Preprocessing Feature extraction Split into train & test Machine Learning pipeline
  264. 264. Raw Data Preprocessing Feature extraction Split into train & test test set Machine Learning pipeline
  265. 265. Raw Data Preprocessing Feature extraction Split into train & test test set Choose a model Machine Learning pipeline
  266. 266. Raw Data Preprocessing Feature extraction Split into train & test test set Choose a model Find best parameters using CV Machine Learning pipeline
  267. 267. Raw Data Preprocessing Feature extraction Split into train & test test set Choose a model Find best parameters using CV Machine Learning pipeline
  268. 268. Raw Data Preprocessing Feature extraction Split into train & test test set Choose a model Find best parameters using CV Train the model on the whole training set Machine Learning pipeline
  269. 269. Raw Data Preprocessing Feature extraction Split into train & test test set Choose a model Find best parameters using CV Train the model on the whole training set Evaluate final model on the test set test set Machine Learning pipeline
  270. 270. Raw Data Preprocessing Feature extraction Split into train & test test set Choose a model Find best parameters using CV Train the model on the whole training set Evaluate final model on the test set test set Machine Learning pipeline Report your results
  271. 271. Raw Data Preprocessing Feature extraction Split into train & test test set Choose a model Find best parameters using CV Train the model on the whole training set Evaluate final model on the test set test set Machine Learning pipeline Report your results Problem
  272. 272. A machine learning algorithm usually corresponds to a combination of the following 3 elements The choice of a specific mapping function family F (K-NN, SVM, DT, RF, Neural Networks etc.).
  273. 273. A machine learning algorithm usually corresponds to a combination of the following 3 elements Way to evaluate the quality of a function f out of F. Ways of saying how bad/good this function f is doing in classifying real world objects. The choice of a specific mapping function family F (K-NN, SVM, DT, RF, Neural Networks etc.).
  274. 274. A machine learning algorithm usually corresponds to a combination of the following 3 elements a way to search for a better function f out of F. How to choose parameters so that the performance of f would improve. Way to evaluate the quality of a function f out of F. Ways of saying how bad/good this function f is doing in classifying real world objects. The choice of a specific mapping function family F (K-NN, SVM, DT, RF, Neural Networks etc.).
  275. 275. https://github.com/sugyan/tensorflow-mnist
  276. 276. References • Machine Learning by Andrew Ng (https://www.coursera.org/learn/machine- learning) • Introduction to Machine Learning by Pascal Vincent given at Deep Learning Summer School, Montreal 2015 (http://videolectures.net/ deeplearning2015_vincent_machine_learning/) • Welcome to Machine Learning by Konstantin Tretyakov delivered at AACIMP Summer School 2015 (http://kt.era.ee/lectures/aacimp2015/1-intro.pdf) • Stanford CS class: Convolutional Neural Networks for Visual Recognition by Andrej Karpathy (http://cs231n.github.io/) • Data Mining Course by Jaak Vilo at University of Tartu (https://courses.cs.ut.ee/ MTAT.03.183/2017_spring/uploads/Main/DM_05_Clustering.pdf) • Machine Learning Essential Conepts by Ilya Kuzovkin (https:// www.slideshare.net/iljakuzovkin) • From the brain to deep learning and back by Raul Vicente Zafra and Ilya Kuzovkin (http://www.uttv.ee/naita?id=23585&keel=eng)
  277. 277. www.biit.cs.ut.ee www.ut.ee www.quretec.ee
  278. 278. You, guys, rock!

×