2. Trinity College Dublin, The University of Dublin
Overview previous lecture
2
• Programming languages (High-level vs. low-level, compiled vs. interpreted)
• Timeline of programming languages
• Why Python
• Setting up Anaconda
• Basic Python instructions
3. Trinity College Dublin, The University of Dublin
Overview lecture
3
• More about Python coding (functions, libraries, IDE, Jupyter notebook)
• Basic plots in Python (Excel style)
• Looking into a real dataset
• The importance of looking at the data in ML
• More complicated plots
• Descriptive statistics
4. Trinity College Dublin, The University of Dublin 4
Week LECTURE LAB Additional hour and deadlines
1 Overview Introduction to programming and Python
programming tutorial
2 Descriptive stats vs. ML. Data visual.
Discussion board task explanation
Python programming and basic
visualisation tutorial
3 Supervised Learn. Simple classifiers Data visualisation and simple classification
tutorial
4 Classification – part 1 Sup.L. Model quality evaluation tutorial Data preparation and visualization
test (1h) – immediately after the tutorial
5 Classification – part 2: algorithms
Homework 1 explanation
Classification tutorial
6 Classification part 3 and Regression Regression and time-series tutorial
7 Reading week -
8 Regression
Homework 2 explanation
Unsupervised learning tutorial Homework 1 deadline
9 Regression and Unsupervised learning Homework hour with Q&A
10 Recap and feature selection Anomaly detection tutorial
11 Data sharing, storage, and privacy Homework hour with Q&A Homework 2 deadline
12 Guest lecture Discussion
Introduction to Machine Learning – 2022-2023
Written test (2h)
5. Trinity College Dublin, The University of Dublin 5
Problem/question Data collection
Preprocessing /
cleaning
Analysing
Interpretation /
outcome
Improve
ML
End-to-end ML pipeline
Visualisation Visualisation Visualisation
Running the pipeline without looking at the data, without understanding the data.. Big mistake
6. Trinity College Dublin, The University of Dublin 6
End-to-end ML pipeline
- Running ML algorithms without understanding the problem or the data?
Dangerous and not very useful..
- ML is a tool, not magic! We are the users. We need to understand:
1. How the ML algorithm we use works (let’s not use ML as a black-box)
2. What the problem/question is
3. What are the challenges/issues with the dataset (otherwise, the results may be misleading)
7. Trinity College Dublin, The University of Dublin 7
City Bikes in Smart Cities
Oliveira F. et al., Survey of Technologies and Recent
Developments for Sustainable Smart Cycling. Sustainability 2021 IoT: Internet of Things: Network of physical objects (e.g., sensors)
8. Trinity College Dublin, The University of Dublin 8
City Bikes in Smart Cities
Oliveira F. et al., Survey of Technologies and Recent
Developments for Sustainable Smart Cycling. Sustainability 2021
- Predict number of bike-shares for a particular station – when does the
company need to move the bikes, minimise cost
9. Trinity College Dublin, The University of Dublin 9
Looking at the data
First, look into what exactly the dataset is
10. Trinity College Dublin, The University of Dublin 10
Looking at the data
Look at the specifics of the data structure
Numerical vs. categorical attributes
11. Trinity College Dublin, The University of Dublin 11
Looking at the data
Look at a portion of the data. Is everything as stated in the specifics?
12. Trinity College Dublin, The University of Dublin 12
Visualising/Inspecting the data
Maybe the company
moved the bikes?
13. Trinity College Dublin, The University of Dublin 13
Data visualisation in Python
Many libraries. More than one approach.
Typical library:
- Matplotlib
(Seaborn for better graphics)
14. Trinity College Dublin, The University of Dublin 14
Data visualisation in Python
Many libraries. More than one approach.
Typical library:
- Matplotlib
(Seaborn for better graphics)
15. Trinity College Dublin, The University of Dublin 15
Visualising/Inspecting the data
https://colab.research.google.com/notebooks/charts.ipynb
20. Trinity College Dublin, The University of Dublin 20
Visualising/Inspecting the data
All districts in California “Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
21. Trinity College Dublin, The University of Dublin 21
Visualising/Inspecting the data
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
22. Trinity College Dublin, The University of Dublin 22
Visualising/Inspecting the data
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
23. Trinity College Dublin, The University of Dublin
Salary may not be just a number. It changes over time. If we have the
salary per month in the last 5 years, one approach could be to
calculate the mean or median income per month. This is a form of
preprocessing
23
Using preprocessed data?
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
24. Trinity College Dublin, The University of Dublin 24
Using preprocessed data?
It’s fine to use preprocessed data. But check that. Understand the specifics i.e. how was the preprocessing done exactly. Otherwise your
results may be misleading!
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
25. Trinity College Dublin, The University of Dublin 25
Descriptive vs. Inferential Statistics
e.g., mean, median, std
A descriptive statistic:
Quantitative summary of selected
features from a dataset.
Descriptive statistics:
The process of using and analysing those
statistics
Inferential statistic:
Infer properties of a population from a dataset
(sampled from a larger population)
Population
Dataset
Descriptive stats
Inferential stats
or statistical inference
26. Trinity College Dublin, The University of Dublin 26
Descriptive Statistics
Source: Wikipedia
The Normal distribution
https://www.youtube.com/watch?v=mtbJbDwqWLE
CLT (Central Limit Theorem) real-world example
https://www.youtube.com/watch?v=N7wW1dlmMaE
27. Trinity College Dublin, The University of Dublin 27
Descriptive Statistics
The Normal distribution
https://www.youtube.com/watch?v=mtbJbDwqWLE
Mean Standard deviation
28. Trinity College Dublin, The University of Dublin 28
Descriptive Statistics
The Normal distribution
https://www.youtube.com/watch?v=mtbJbDwqWLE
Mean Standard deviation
Height
The Normal distribution = Bell curve =
Gaussian distribution (Carl Gauss)
Mean = median = mode
Symmetrical, with asymptotic tails
29. Trinity College Dublin, The University of Dublin 29
Why is the Normal distribution so important?
30. Trinity College Dublin, The University of Dublin 30
Central Limit Theorem
https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1
A core theorem for statistics and
statistical inference
Population
Subsamples
Let’s consider all the mean values within each subsamples = Sampling distribution
= distribution of the sample means. The CLT tells us that this is normally distributed
Sampling -> e.g., elections
31. Trinity College Dublin, The University of Dublin 31
Correlation coefficient
If when x is above its mean also y is above its mean, and vice versa, then the correlation is positive.
Example: The higher you go on a mountain, the colder it usually gets (negative correlation between
altitude and temperature)
Do not be afraid of names. Don’t be afraid of formulas.
e.g., Correlation, Covariance, Cosine distance, sound like very
different things, but they are actually very close in many ways
32. Trinity College Dublin, The University of Dublin 32
Correlation coefficient
Interesting relationships,
But no correlation
x
y
In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed.