IntroML_4_Classification

Introduction to
Machine Learning
(5 ECTS)
Giovanni Di Liberto
Asst. Prof. in Intelligent Systems, SCSS
Room G.15, O’Reilly Institute ©Trinity College Dublin

Trinity College Dublin, The University of Dublin
Overview previous lecture
2
• Inspecting the data
• Data visualisation
• Descriptive and inferential statistics
• Central limit theorem
• Correlation

Overview lecture
3
• More about correlation
• Classification – a bit of theory
• Binary classification
• Baseline
• Multiclass
• This is not in the test!

Trinity College Dublin, The University of Dublin 4
Why is the Normal distribution so important?

Central Limit Theorem
https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1
A core theorem for statistics and
statistical inference
Population
Subsamples
Let’s consider all the mean values within each subsamples = Sampling distribution
= distribution of the sample means. The CLT tells us that this is normally distributed
Sampling -> e.g., elections

Correlation coefficient
If when x is above its mean also y is above its mean, and vice versa, then the correlation is positive.
Example: The higher you go on a mountain, the colder it usually gets (negative correlation between
altitude and temperature)

Correlation coefficient
Pearson’s linear correlation Spearman’s rank correlation
x 2 5 4 1
y 3 1 6 3
x 1 2 4 5
y 3 3 6 1
Let’s sort x:
r 2 4 3 1
s 2 1 3 2

Correlation is NOT causation!
NEGATIVE CORRELATION (the x-axis was flipped here

Correlation script (see Blackboard week 4) –
let’s play with it

In brief
Correlation coefficient (r-value): How strong is that (linear) relationship
Correlation statistical significance (p-value): How confident are we that the correlation is not there by chance
(both the correlation strength and the number of data-points affect the p-value)

Classification
11
• Supervised learning
• Spam filter example
• Binary classification (2 classes: “spam” vs. “not spam”)
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019

Classification
12
• What could our features be?
• Particular keywords (e.g., “Dear Respected Dr., greetings”)
• Particular senders
• Email sent simultaneously to thousands of people
• Those features on their own are insufficient (it is possible, not
necessarily spam, that an email is sent to thousands of people)
• Building a ML model based on all those features simultaneously would
be much better
• Combine them somehow. Some are better for the classification (higher
classification weight)

Classification
13
For example, using only one feature:
Number of possible spam keywords
nk
spam
Not spam
n > nk
n < nk
n
Email ID Number of spam
keywords
1 3
2 2
3 25
4 3
… …

Classification
14
For example, two features can be used simultaneously to identify a better classification boundary than with
any each feature individually
But how do we do that? ML!

Classification
15
Dataset Dataset
Class
1 0 0 0 0
Binary classification task:
- Is this a number five or not?
Class
5 0 4 1 9
Multiclass classification task:
- What digit is this?

Classification
16
Dataset
Class
1 0 0 0 0
Binary classification task:
- Is this a number five or not?
Features
- Each pixel is a feature.
- For example, 64x64 pixels would mean 4096 features.
- Each feature is a number from 0 to 1 (greyscale),
where 1 is white, 0 is black, and in-between there are
various shades of grey.

Classification
17
Class
1 - A LINEAR classifier determines whether
the class should be 1 or 0 by performing a
linear combination of all weights
- A linear combination is a weighted sum of
all features
- E.g., weight1*pixel1 + weight2*pixel2 + …
- The weights are chosen to maximise the
classification accuracy
Features
- Each pixel is a feature.
- For example, 64x64 pixels would mean 4096 features.
- Each feature is a number from 0 to 1 (greyscale),
where 1 is white, 0 is black, and in-between there are
various shades of grey.
How do we even plot 4096 features??

Classification
18
Hypothesis driven feature extraction? Average value in selected areas
Feature 1 Feature 2

Classification
19
Not a five
Five
Linear classification
boundary

Classification
20
X is the data matrix
(features)
y is the class (‘five’ or
‘not a five’)

Confusion matrix
21
Prediction Accuracy = (3+5)/(3+5+1+2) = 8/11 ~ 0.73

Confusion matrix
22
S = Sick
H = Healthy
Not very good for
diagnosis:
Ideally, almost
100% recall. We
don’t want to miss
diagnosing a person
that is sick.
H
H
H
H H
H
H
S S
S
S
S
S S
4 out of 7
4 out of 6

Confusion matrix
23
Prediction Accuracy = (3+5)/(3+5+1+2) = 8/11 ~ 0.73
S = Sick
H = Healthy
..better model
H
H
H
H H
H
H
S S
S
S
S
6 out of 7
4 out of 6
S
S

What can go wrong? What’s a bad prediction?
Predictive models make errors! We want to minimise the error rate, We have to decide what kind of
errors are acceptable. Do we prefer false-positive (e.g., a red-flag for a healthy person) or false-negative
(e.g., not detecting a person that is sick)?
The same question applies beyond classification. For example,
Overbooking system (e.g., Ryanair vs. American airlines).
Goal? To maximise profit!
Let’s simplify this the most that we can:
Profit ≈ Total revenue - costs
Profit ≈ Σ(ticketCost + onboardPurchases) – Σ(fixed costs + sum(costPerPassenger) + ?)
Profit ≈ Σ(ticketCost + onboardPurchases) – Σ(fixed costs + sum(costPerPassenger) + costVouchers)
Trade-off / optimise profit: Probability of error vs. probability of having empty seats
We are going to make mistakes and pay vouchers. The question is how many we want to make.

Profit ≈ Σ(ticketCost + onboardPurchases) – Σ(fixed costs + sum(costPerPassenger) + costVouchers)
In that case, the goal was to maximise the profit
We could have other goals e.g., minimise pollution, minimise time on the road for cars

IntroML_4_Classification

Recommended

Recommended

More Related Content

Similar to IntroML_4_Classification

Similar to IntroML_4_Classification (20)

Recently uploaded

Recently uploaded (20)

IntroML_4_Classification