2. Trinity College Dublin, The University of Dublin
Overview previous lecture
2
• Inspecting the data
• Data visualisation
• Descriptive and inferential statistics
• Central limit theorem
• Correlation
3. Trinity College Dublin, The University of Dublin
Overview lecture
3
• More about correlation
• Classification – a bit of theory
• Binary classification
• Baseline
• Multiclass
• This is not in the test!
4. Trinity College Dublin, The University of Dublin 4
Why is the Normal distribution so important?
5. Trinity College Dublin, The University of Dublin 5
Central Limit Theorem
https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1
A core theorem for statistics and
statistical inference
Population
Subsamples
Let’s consider all the mean values within each subsamples = Sampling distribution
= distribution of the sample means. The CLT tells us that this is normally distributed
Sampling -> e.g., elections
6. Trinity College Dublin, The University of Dublin 6
Correlation coefficient
If when x is above its mean also y is above its mean, and vice versa, then the correlation is positive.
Example: The higher you go on a mountain, the colder it usually gets (negative correlation between
altitude and temperature)
7. Trinity College Dublin, The University of Dublin 7
Correlation coefficient
Pearson’s linear correlation Spearman’s rank correlation
x 2 5 4 1
y 3 1 6 3
x 1 2 4 5
y 3 3 6 1
Let’s sort x:
r 2 4 3 1
s 2 1 3 2
8. Trinity College Dublin, The University of Dublin 8
Correlation is NOT causation!
NEGATIVE CORRELATION (the x-axis was flipped here
9. Trinity College Dublin, The University of Dublin 9
Correlation script (see Blackboard week 4) –
let’s play with it
10. Trinity College Dublin, The University of Dublin 10
In brief
Correlation coefficient (r-value): How strong is that (linear) relationship
Correlation statistical significance (p-value): How confident are we that the correlation is not there by chance
(both the correlation strength and the number of data-points affect the p-value)
11. Trinity College Dublin, The University of Dublin
Classification
11
• Supervised learning
• Spam filter example
• Binary classification (2 classes: “spam” vs. “not spam”)
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
12. Trinity College Dublin, The University of Dublin
Classification
12
• What could our features be?
• Particular keywords (e.g., “Dear Respected Dr., greetings”)
• Particular senders
• Email sent simultaneously to thousands of people
• Those features on their own are insufficient (it is possible, not
necessarily spam, that an email is sent to thousands of people)
• Building a ML model based on all those features simultaneously would
be much better
• Combine them somehow. Some are better for the classification (higher
classification weight)
13. Trinity College Dublin, The University of Dublin
Classification
13
For example, using only one feature:
Number of possible spam keywords
nk
spam
Not spam
n > nk
n < nk
n
Email ID Number of spam
keywords
1 3
2 2
3 25
4 3
… …
14. Trinity College Dublin, The University of Dublin
Classification
14
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
For example, two features can be used simultaneously to identify a better classification boundary than with
any each feature individually
But how do we do that? ML!
15. Trinity College Dublin, The University of Dublin
Classification
15
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Dataset Dataset
Class
1 0 0 0 0
Binary classification task:
- Is this a number five or not?
Class
5 0 4 1 9
Multiclass classification task:
- What digit is this?
16. Trinity College Dublin, The University of Dublin
Classification
16
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Dataset
Class
1 0 0 0 0
Binary classification task:
- Is this a number five or not?
Features
- Each pixel is a feature.
- For example, 64x64 pixels would mean 4096 features.
- Each feature is a number from 0 to 1 (greyscale),
where 1 is white, 0 is black, and in-between there are
various shades of grey.
17. Trinity College Dublin, The University of Dublin
Classification
17
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Class
1 - A LINEAR classifier determines whether
the class should be 1 or 0 by performing a
linear combination of all weights
- A linear combination is a weighted sum of
all features
- E.g., weight1*pixel1 + weight2*pixel2 + …
- The weights are chosen to maximise the
classification accuracy
Features
- Each pixel is a feature.
- For example, 64x64 pixels would mean 4096 features.
- Each feature is a number from 0 to 1 (greyscale),
where 1 is white, 0 is black, and in-between there are
various shades of grey.
How do we even plot 4096 features??
18. Trinity College Dublin, The University of Dublin
Classification
18
Hypothesis driven feature extraction? Average value in selected areas
Feature 1 Feature 2
19. Trinity College Dublin, The University of Dublin
Classification
19
Not a five
Five
Linear classification
boundary
20. Trinity College Dublin, The University of Dublin
Classification
20
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
X is the data matrix
(features)
y is the class (‘five’ or
‘not a five’)
21. Trinity College Dublin, The University of Dublin
Confusion matrix
21
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Prediction Accuracy = (3+5)/(3+5+1+2) = 8/11 ~ 0.73
22. Trinity College Dublin, The University of Dublin
Confusion matrix
22
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
S = Sick
H = Healthy
Not very good for
diagnosis:
Ideally, almost
100% recall. We
don’t want to miss
diagnosing a person
that is sick.
H
H
H
H H
H
H
S S
S
S
S
S S
4 out of 7
4 out of 6
23. Trinity College Dublin, The University of Dublin
Confusion matrix
23
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Prediction Accuracy = (3+5)/(3+5+1+2) = 8/11 ~ 0.73
S = Sick
H = Healthy
..better model
H
H
H
H H
H
H
S S
S
S
S
6 out of 7
4 out of 6
S
S
24. Trinity College Dublin, The University of Dublin 24
What can go wrong? What’s a bad prediction?
Predictive models make errors! We want to minimise the error rate, We have to decide what kind of
errors are acceptable. Do we prefer false-positive (e.g., a red-flag for a healthy person) or false-negative
(e.g., not detecting a person that is sick)?
The same question applies beyond classification. For example,
Overbooking system (e.g., Ryanair vs. American airlines).
Goal? To maximise profit!
Let’s simplify this the most that we can:
Profit ≈ Total revenue - costs
Profit ≈ Σ(ticketCost + onboardPurchases) – Σ(fixed costs + sum(costPerPassenger) + ?)
Profit ≈ Σ(ticketCost + onboardPurchases) – Σ(fixed costs + sum(costPerPassenger) + costVouchers)
Trade-off / optimise profit: Probability of error vs. probability of having empty seats
We are going to make mistakes and pay vouchers. The question is how many we want to make.
25. Trinity College Dublin, The University of Dublin 25
Profit ≈ Σ(ticketCost + onboardPurchases) – Σ(fixed costs + sum(costPerPassenger) + costVouchers)
In that case, the goal was to maximise the profit
We could have other goals e.g., minimise pollution, minimise time on the road for cars