SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Introduction to
Machine Learning
(5 ECTS)
Giovanni Di Liberto
Asst. Prof. in Intelligent Systems, SCSS
Room G.15, O’Reilly Institute ©Trinity College Dublin
Trinity College Dublin, The University of Dublin
Overview previous lecture
2
• Programming languages (High-level vs. low-level, compiled vs. interpreted)
• Timeline of programming languages
• Why Python
• Setting up Anaconda
• Basic Python instructions
Trinity College Dublin, The University of Dublin
Overview lecture
3
• More about Python coding (functions, libraries, IDE, Jupyter notebook)
• Basic plots in Python (Excel style)
• Looking into a real dataset
• The importance of looking at the data in ML
• More complicated plots
• Descriptive statistics
Trinity College Dublin, The University of Dublin 4
Week LECTURE LAB Additional hour and deadlines
1 Overview Introduction to programming and Python
programming tutorial
2 Descriptive stats vs. ML. Data visual.
Discussion board task explanation
Python programming and basic
visualisation tutorial
3 Supervised Learn. Simple classifiers Data visualisation and simple classification
tutorial
4 Classification – part 1 Sup.L. Model quality evaluation tutorial Data preparation and visualization
test (1h) – immediately after the tutorial
5 Classification – part 2: algorithms
Homework 1 explanation
Classification tutorial
6 Classification part 3 and Regression Regression and time-series tutorial
7 Reading week -
8 Regression
Homework 2 explanation
Unsupervised learning tutorial Homework 1 deadline
9 Regression and Unsupervised learning Homework hour with Q&A
10 Recap and feature selection Anomaly detection tutorial
11 Data sharing, storage, and privacy Homework hour with Q&A Homework 2 deadline
12 Guest lecture Discussion
Introduction to Machine Learning – 2022-2023
Written test (2h)
Trinity College Dublin, The University of Dublin 5
Problem/question Data collection
Preprocessing /
cleaning
Analysing
Interpretation /
outcome
Improve
ML
End-to-end ML pipeline
Visualisation Visualisation Visualisation
Running the pipeline without looking at the data, without understanding the data.. Big mistake
Trinity College Dublin, The University of Dublin 6
End-to-end ML pipeline
- Running ML algorithms without understanding the problem or the data?
Dangerous and not very useful..
- ML is a tool, not magic! We are the users. We need to understand:
1. How the ML algorithm we use works (let’s not use ML as a black-box)
2. What the problem/question is
3. What are the challenges/issues with the dataset (otherwise, the results may be misleading)
Trinity College Dublin, The University of Dublin 7
City Bikes in Smart Cities
Oliveira F. et al., Survey of Technologies and Recent
Developments for Sustainable Smart Cycling. Sustainability 2021 IoT: Internet of Things: Network of physical objects (e.g., sensors)
Trinity College Dublin, The University of Dublin 8
City Bikes in Smart Cities
Oliveira F. et al., Survey of Technologies and Recent
Developments for Sustainable Smart Cycling. Sustainability 2021
- Predict number of bike-shares for a particular station – when does the
company need to move the bikes, minimise cost
Trinity College Dublin, The University of Dublin 9
Looking at the data
First, look into what exactly the dataset is
Trinity College Dublin, The University of Dublin 10
Looking at the data
Look at the specifics of the data structure
Numerical vs. categorical attributes
Trinity College Dublin, The University of Dublin 11
Looking at the data
Look at a portion of the data. Is everything as stated in the specifics?
Trinity College Dublin, The University of Dublin 12
Visualising/Inspecting the data
Maybe the company
moved the bikes?
Trinity College Dublin, The University of Dublin 13
Data visualisation in Python
Many libraries. More than one approach.
Typical library:
- Matplotlib
(Seaborn for better graphics)
Trinity College Dublin, The University of Dublin 14
Data visualisation in Python
Many libraries. More than one approach.
Typical library:
- Matplotlib
(Seaborn for better graphics)
Trinity College Dublin, The University of Dublin 15
Visualising/Inspecting the data
https://colab.research.google.com/notebooks/charts.ipynb
Trinity College Dublin, The University of Dublin 16
Visualising/Inspecting the data
Trinity College Dublin, The University of Dublin 17
Visualising/Inspecting the data
Trinity College Dublin, The University of Dublin 18
Visualising/Inspecting the data
Create figures with subplots
Useful for clustering analysis
Trinity College Dublin, The University of Dublin 19
Visualising/Inspecting the data
Trinity College Dublin, The University of Dublin 20
Visualising/Inspecting the data
All districts in California “Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Trinity College Dublin, The University of Dublin 21
Visualising/Inspecting the data
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Trinity College Dublin, The University of Dublin 22
Visualising/Inspecting the data
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Trinity College Dublin, The University of Dublin
Salary may not be just a number. It changes over time. If we have the
salary per month in the last 5 years, one approach could be to
calculate the mean or median income per month. This is a form of
preprocessing
23
Using preprocessed data?
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Trinity College Dublin, The University of Dublin 24
Using preprocessed data?
It’s fine to use preprocessed data. But check that. Understand the specifics i.e. how was the preprocessing done exactly. Otherwise your
results may be misleading!
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Trinity College Dublin, The University of Dublin 25
Descriptive vs. Inferential Statistics
e.g., mean, median, std
A descriptive statistic:
Quantitative summary of selected
features from a dataset.
Descriptive statistics:
The process of using and analysing those
statistics
Inferential statistic:
Infer properties of a population from a dataset
(sampled from a larger population)
Population
Dataset
Descriptive stats
Inferential stats
or statistical inference
Trinity College Dublin, The University of Dublin 26
Descriptive Statistics
Source: Wikipedia
The Normal distribution
https://www.youtube.com/watch?v=mtbJbDwqWLE
CLT (Central Limit Theorem) real-world example
https://www.youtube.com/watch?v=N7wW1dlmMaE
Trinity College Dublin, The University of Dublin 27
Descriptive Statistics
The Normal distribution
https://www.youtube.com/watch?v=mtbJbDwqWLE
Mean Standard deviation
Trinity College Dublin, The University of Dublin 28
Descriptive Statistics
The Normal distribution
https://www.youtube.com/watch?v=mtbJbDwqWLE
Mean Standard deviation
Height
The Normal distribution = Bell curve =
Gaussian distribution (Carl Gauss)
Mean = median = mode
Symmetrical, with asymptotic tails
Trinity College Dublin, The University of Dublin 29
Why is the Normal distribution so important?
Trinity College Dublin, The University of Dublin 30
Central Limit Theorem
https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1
A core theorem for statistics and
statistical inference
Population
Subsamples
Let’s consider all the mean values within each subsamples = Sampling distribution
= distribution of the sample means. The CLT tells us that this is normally distributed
Sampling -> e.g., elections
Trinity College Dublin, The University of Dublin 31
Correlation coefficient
If when x is above its mean also y is above its mean, and vice versa, then the correlation is positive.
Example: The higher you go on a mountain, the colder it usually gets (negative correlation between
altitude and temperature)
Do not be afraid of names. Don’t be afraid of formulas.
e.g., Correlation, Covariance, Cosine distance, sound like very
different things, but they are actually very close in many ways
Trinity College Dublin, The University of Dublin 32
Correlation coefficient
Interesting relationships,
But no correlation
x
y
Trinity College Dublin, The University of Dublin 33
Correlation is NOT causation!
Trinity College Dublin, The University of Dublin 34
Correlation is NOT causation!
Trinity College Dublin, The University of Dublin 35
Correlation coefficient

Weitere ähnliche Inhalte

Ähnlich wie IntroML_3_DataVisualisation

A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
ArmyTrilidiaDevegaSK
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Chris Hammerschmidt
 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
AnirbanBhar3
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing

Ähnlich wie IntroML_3_DataVisualisation (20)

Finding Commonalities: from Description Logics to the Web of Data
Finding Commonalities: from Description Logics to the Web of DataFinding Commonalities: from Description Logics to the Web of Data
Finding Commonalities: from Description Logics to the Web of Data
 
Pine education-platform
Pine education-platformPine education-platform
Pine education-platform
 
A Discrete Krill Herd Optimization Algorithm for Community Detection
A Discrete Krill Herd Optimization Algorithm for Community DetectionA Discrete Krill Herd Optimization Algorithm for Community Detection
A Discrete Krill Herd Optimization Algorithm for Community Detection
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
 
Data Wrangling Week 4
Data Wrangling Week 4Data Wrangling Week 4
Data Wrangling Week 4
 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application Testing
 
HILDA 2023 Keynote Bill Howe
HILDA 2023 Keynote Bill HoweHILDA 2023 Keynote Bill Howe
HILDA 2023 Keynote Bill Howe
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Deploying Open Learning Analytics at a National Scale
Deploying Open Learning Analytics at a National ScaleDeploying Open Learning Analytics at a National Scale
Deploying Open Learning Analytics at a National Scale
 
Deploying Open Learning Analytics at a National Scale
Deploying Open Learning Analytics at a National ScaleDeploying Open Learning Analytics at a National Scale
Deploying Open Learning Analytics at a National Scale
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Learner Analytics: from Buzz to Strategic Role Academic Technologists
Learner Analytics:  from Buzz to Strategic Role Academic TechnologistsLearner Analytics:  from Buzz to Strategic Role Academic Technologists
Learner Analytics: from Buzz to Strategic Role Academic Technologists
 
Whitmer, Fernandes, Kodai CSU Chico Learner Analytics
Whitmer, Fernandes, Kodai CSU Chico Learner AnalyticsWhitmer, Fernandes, Kodai CSU Chico Learner Analytics
Whitmer, Fernandes, Kodai CSU Chico Learner Analytics
 
Introduction to Data Mining KDD Process OLAP
Introduction to Data Mining KDD Process OLAPIntroduction to Data Mining KDD Process OLAP
Introduction to Data Mining KDD Process OLAP
 
Introduction to data mining which covers the basics
Introduction to data mining which covers the basicsIntroduction to data mining which covers the basics
Introduction to data mining which covers the basics
 
Lecture 1.pptx
Lecture 1.pptxLecture 1.pptx
Lecture 1.pptx
 

Kürzlich hochgeladen

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

IntroML_3_DataVisualisation

  • 1. Introduction to Machine Learning (5 ECTS) Giovanni Di Liberto Asst. Prof. in Intelligent Systems, SCSS Room G.15, O’Reilly Institute ©Trinity College Dublin
  • 2. Trinity College Dublin, The University of Dublin Overview previous lecture 2 • Programming languages (High-level vs. low-level, compiled vs. interpreted) • Timeline of programming languages • Why Python • Setting up Anaconda • Basic Python instructions
  • 3. Trinity College Dublin, The University of Dublin Overview lecture 3 • More about Python coding (functions, libraries, IDE, Jupyter notebook) • Basic plots in Python (Excel style) • Looking into a real dataset • The importance of looking at the data in ML • More complicated plots • Descriptive statistics
  • 4. Trinity College Dublin, The University of Dublin 4 Week LECTURE LAB Additional hour and deadlines 1 Overview Introduction to programming and Python programming tutorial 2 Descriptive stats vs. ML. Data visual. Discussion board task explanation Python programming and basic visualisation tutorial 3 Supervised Learn. Simple classifiers Data visualisation and simple classification tutorial 4 Classification – part 1 Sup.L. Model quality evaluation tutorial Data preparation and visualization test (1h) – immediately after the tutorial 5 Classification – part 2: algorithms Homework 1 explanation Classification tutorial 6 Classification part 3 and Regression Regression and time-series tutorial 7 Reading week - 8 Regression Homework 2 explanation Unsupervised learning tutorial Homework 1 deadline 9 Regression and Unsupervised learning Homework hour with Q&A 10 Recap and feature selection Anomaly detection tutorial 11 Data sharing, storage, and privacy Homework hour with Q&A Homework 2 deadline 12 Guest lecture Discussion Introduction to Machine Learning – 2022-2023 Written test (2h)
  • 5. Trinity College Dublin, The University of Dublin 5 Problem/question Data collection Preprocessing / cleaning Analysing Interpretation / outcome Improve ML End-to-end ML pipeline Visualisation Visualisation Visualisation Running the pipeline without looking at the data, without understanding the data.. Big mistake
  • 6. Trinity College Dublin, The University of Dublin 6 End-to-end ML pipeline - Running ML algorithms without understanding the problem or the data? Dangerous and not very useful.. - ML is a tool, not magic! We are the users. We need to understand: 1. How the ML algorithm we use works (let’s not use ML as a black-box) 2. What the problem/question is 3. What are the challenges/issues with the dataset (otherwise, the results may be misleading)
  • 7. Trinity College Dublin, The University of Dublin 7 City Bikes in Smart Cities Oliveira F. et al., Survey of Technologies and Recent Developments for Sustainable Smart Cycling. Sustainability 2021 IoT: Internet of Things: Network of physical objects (e.g., sensors)
  • 8. Trinity College Dublin, The University of Dublin 8 City Bikes in Smart Cities Oliveira F. et al., Survey of Technologies and Recent Developments for Sustainable Smart Cycling. Sustainability 2021 - Predict number of bike-shares for a particular station – when does the company need to move the bikes, minimise cost
  • 9. Trinity College Dublin, The University of Dublin 9 Looking at the data First, look into what exactly the dataset is
  • 10. Trinity College Dublin, The University of Dublin 10 Looking at the data Look at the specifics of the data structure Numerical vs. categorical attributes
  • 11. Trinity College Dublin, The University of Dublin 11 Looking at the data Look at a portion of the data. Is everything as stated in the specifics?
  • 12. Trinity College Dublin, The University of Dublin 12 Visualising/Inspecting the data Maybe the company moved the bikes?
  • 13. Trinity College Dublin, The University of Dublin 13 Data visualisation in Python Many libraries. More than one approach. Typical library: - Matplotlib (Seaborn for better graphics)
  • 14. Trinity College Dublin, The University of Dublin 14 Data visualisation in Python Many libraries. More than one approach. Typical library: - Matplotlib (Seaborn for better graphics)
  • 15. Trinity College Dublin, The University of Dublin 15 Visualising/Inspecting the data https://colab.research.google.com/notebooks/charts.ipynb
  • 16. Trinity College Dublin, The University of Dublin 16 Visualising/Inspecting the data
  • 17. Trinity College Dublin, The University of Dublin 17 Visualising/Inspecting the data
  • 18. Trinity College Dublin, The University of Dublin 18 Visualising/Inspecting the data Create figures with subplots Useful for clustering analysis
  • 19. Trinity College Dublin, The University of Dublin 19 Visualising/Inspecting the data
  • 20. Trinity College Dublin, The University of Dublin 20 Visualising/Inspecting the data All districts in California “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Aurélien Géron, 2019
  • 21. Trinity College Dublin, The University of Dublin 21 Visualising/Inspecting the data “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Aurélien Géron, 2019
  • 22. Trinity College Dublin, The University of Dublin 22 Visualising/Inspecting the data “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Aurélien Géron, 2019
  • 23. Trinity College Dublin, The University of Dublin Salary may not be just a number. It changes over time. If we have the salary per month in the last 5 years, one approach could be to calculate the mean or median income per month. This is a form of preprocessing 23 Using preprocessed data? “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Aurélien Géron, 2019
  • 24. Trinity College Dublin, The University of Dublin 24 Using preprocessed data? It’s fine to use preprocessed data. But check that. Understand the specifics i.e. how was the preprocessing done exactly. Otherwise your results may be misleading! “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Aurélien Géron, 2019
  • 25. Trinity College Dublin, The University of Dublin 25 Descriptive vs. Inferential Statistics e.g., mean, median, std A descriptive statistic: Quantitative summary of selected features from a dataset. Descriptive statistics: The process of using and analysing those statistics Inferential statistic: Infer properties of a population from a dataset (sampled from a larger population) Population Dataset Descriptive stats Inferential stats or statistical inference
  • 26. Trinity College Dublin, The University of Dublin 26 Descriptive Statistics Source: Wikipedia The Normal distribution https://www.youtube.com/watch?v=mtbJbDwqWLE CLT (Central Limit Theorem) real-world example https://www.youtube.com/watch?v=N7wW1dlmMaE
  • 27. Trinity College Dublin, The University of Dublin 27 Descriptive Statistics The Normal distribution https://www.youtube.com/watch?v=mtbJbDwqWLE Mean Standard deviation
  • 28. Trinity College Dublin, The University of Dublin 28 Descriptive Statistics The Normal distribution https://www.youtube.com/watch?v=mtbJbDwqWLE Mean Standard deviation Height The Normal distribution = Bell curve = Gaussian distribution (Carl Gauss) Mean = median = mode Symmetrical, with asymptotic tails
  • 29. Trinity College Dublin, The University of Dublin 29 Why is the Normal distribution so important?
  • 30. Trinity College Dublin, The University of Dublin 30 Central Limit Theorem https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1 A core theorem for statistics and statistical inference Population Subsamples Let’s consider all the mean values within each subsamples = Sampling distribution = distribution of the sample means. The CLT tells us that this is normally distributed Sampling -> e.g., elections
  • 31. Trinity College Dublin, The University of Dublin 31 Correlation coefficient If when x is above its mean also y is above its mean, and vice versa, then the correlation is positive. Example: The higher you go on a mountain, the colder it usually gets (negative correlation between altitude and temperature) Do not be afraid of names. Don’t be afraid of formulas. e.g., Correlation, Covariance, Cosine distance, sound like very different things, but they are actually very close in many ways
  • 32. Trinity College Dublin, The University of Dublin 32 Correlation coefficient Interesting relationships, But no correlation x y
  • 33. Trinity College Dublin, The University of Dublin 33 Correlation is NOT causation!
  • 34. Trinity College Dublin, The University of Dublin 34 Correlation is NOT causation!
  • 35. Trinity College Dublin, The University of Dublin 35 Correlation coefficient

Hinweis der Redaktion

  1. generalise
  2. In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed.