SlideShare ist ein Scribd-Unternehmen logo
1 von 99
Downloaden Sie, um offline zu lesen
Machine
Learning Deep
Dive
This course content is being actively
developed by Delta Analytics, a 501(c)3
Bay Area nonprofit that aims to
empower communities to leverage their
data for good.
Please reach out with any questions or
feedback to inquiry@deltanalytics.org.
Find out more about our mission here.
Delta Analytics builds technical capacity
around the world.
Module 2:
Machine learning
building blocks.
Course overview:
Now let’s turn to the data we will be using...
✓ Module 1: Introduction to Machine Learning
✓ Module 2: Machine Learning Deep Dive
❏ Module 3: Model Selection and Evaluation
❏ Module 4: Linear Regression
❏ Module 5: Decision Trees
❏ Module 6: Ensemble Algorithms
❏ Module 7: Unsupervised Learning Algorithms
❏ Module 8: Natural Language Processing Part 1
❏ Module 9: Natural Language Processing Part 2
❏ Model Development
❏ Defining the machine learning task
❏ Measuring performance of your model
❏ Supervised vs. unsupervised learning methods
❏ Model Validation
Module Checklist
Modeling Phase
Now we have our research question, we are able to
start modeling!
Research
Question
Exploratory
Analysis
Modeling
Phase
Performance
Data
Cleaning
Learning
Methodology
Task
Research question from Module 1:
● How does loan amount requested vary by town?
Now we have our research question, we are able to
start modeling!
Research
Question
Exploratory
Analysis
Modeling
Phase
Performance
Data
Cleaning
We are here!
Learning
Methodology
Task
We will discuss model
performance in the
next module
In this module we introduce the
first two steps of modeling:
● Defining the machine learning task
● Understanding how the machine
learns
Machine learning allows us to tackle
tasks that are too difficult to code all
possible approaches to on our own.
By allowing machines to learn from
experience, we avoid the need for
humans to specify all the knowledge
a computer needs.
Let’s start at the basics. Why do we
want to build a model?
Modeling
Phase
Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning
Based on our
experience of the
world, we have an
understanding of
relationships between
features
Machine learning models quantify and learn the
patterns we observe in data.
Computers acquire
human intuition and
quantify it, by extracting
patterns from raw data
Human Intuition Machine Learning
Model
Task
What is the problem we want our
model to solve?
Performance
Measure
Quantitative measure we use to
evaluate the model’s performance.
Learning
Methodology
ML algorithms can be supervised or
unsupervised. This determines the
learning methodology.
All models have 3 key components: a task, a
learning methodology and a performance measure
Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning
Modeling
Phase
Task
What is the problem we want our
model to solve?
Defining f(x)
What function will map our x (input)
as close as possible to the true Y
(output).
Feature
engineering
& selection
What is x? How do we decide what
explanatory features to include in
our model?
Today we are looking closer at each component of the
framework:
What assumptions does our model
make about the data? Do we have to
transform the data?
Is our f(x)
correct for
this problem?
Learning
Methodology
Is our model supervised or
unsupervised; how does that affect
the learning processing?
What is our
loss function?
Every supervised model has a loss
function it wants to minimize.
Optimization
process
How does the model minimize the
loss function.
Learning methodology: how does the model learn the
function that best maps x to the true Y?
How does
our ML
model learn?
Overview of how the model
teaches itself.
Performance
Quantitative measure we use to
evaluate the model’s performance.
Measures of
performance
R2, Adjusted
R2, MSE
Feature
Performance
Statistical
significance,
p-values
Overfitting,
underfitting,
bias, variance
Ability to
generalize to
unseen data
Performance: How do we evaluate how useful the model
is, and how we can improve it?
Task and learning
methodology
Supervised
learning
Unsupervised
learning
Hold on! Important disclosure:
ML
Algorithm
Labelled Data
(we have Y in our
data)
No Labelled Data
(we don’t have Y in
our data)
For the next few slides, we will be introducing the
intuition behind machine learning models using
supervised learning examples. Later in this module,
we will explore how unsupervised learning is
different.
1. Task
Task
What is the problem we want our
model to solve?
Research
Question
Recap: How does loan amount
requested on Kiva vary by town in
Kenya?
We have KIVA data about
loan amount requested by
borrowers all over Kenya.
We want to know how the
loan amount requested
varies by town.
Building a model involves turning your
research question into a machine learning
question.
Task
Research question
Machine learning
task
How does loan
amount requested on
Kiva vary by town?
??
Firstly, let’s establish a common vocabulary
to talk about the data.
Source: Subset of Kenyan loans dataset.
Features
Observations
Task
Location.town is an
example of a
feature. Every
column in our data
set is a feature!
Every row of our
dataset is an
observation. When we
include the
observation in our
model it is part of
our training set.
A machine learning task has explanatory
features and an outcome feature.
Outcome featureExplanatory features
Town borrower
lives in
Loan amount
requested
Task
An outcome feature is the feature we expect to change when the explanatory
features are manipulated. In this example, we expect the loan amount to change
when we change the location.
What would the outcome features and explanatory
features be in the research questions below?
Try identifying some:
● What will the price of a stock be tomorrow?
● Does this patient have malaria?
● Would this person buy a car?
Solutions:
Explanatory features Outcome feature
Price of a stock market index today Company X’s stock price tomorrow
Age, symptoms, travel history Whether or not a patient has malaria
Income, location Whether or not a person would buy a car
The outcome feature might be
regression (e.g. $12) or a classification
(e.g. Yes or No). We’ll talk about this
more later!
Let’s define our explanatory and
outcome features for this task
Problem:
I am the mayor of a 30,000
person town and need to justify
spending budget on mosquito
nets.
I want evidence on how the
number of mosquito nets
affects the number of cases of
malaria. Can you help?
Task
Let’s start by identifying the
research question!
The research question is what
we want to find out from the
data, formally stated.
1.Research
Question
2.Task
How does the number of
cases of malaria change
when the number of
mosquito nets changes?
Next let’s define our task!
Task
Define explanatory
and outcome
feature
Define f(x)
Bring it all
together
1 2 3
explanatory feature(s) outcome feature
x Number of
mosquito nets
Number of people
with malariaY
2007: 80
2008: 40
2009: 42
2010: 35
We also call our explanatory features x, and our outcome feature Y.
Looks like as mosquito nets increase, the number of malaria cases decreases.
2007: 1000
2008: 2200
2009: 6600
2010: 12600
How does the number of cases of malaria
change when the number of mosquito nets
changes?
Define explanatory
and outcome
feature
What would you conclude from looking at this
data? How many nets would you recommend?
Task
x Number of
mosquito nets
Number of people
with malariaY
2007: 80
2008: 40
2009: 42
2010: 35
2007: 1000
2008: 2200
2009: 6600
2010: 12600
You came to a conclusion by recognizing a pattern in the data. This is similar
to how a machine learning algorithm would approach the same problem.
Machine learning allows us to learn
from historical patterns.
If Mr. Mayor had no machine learning methods to
use, he could find an answer by trying a different #
of nets year after year.
But this has an obvious human cost, and it would
be very hard to update the model to account for,
for example, new residents to his town.
Machine learning algorithms help answer
questions without this human cost - we are
learning from data, or in other words, learning
from history!
Task
“Over four years,
increasing number of
mosquito nets
decrease the number of
malaria cases.”
● Humans form rules based upon observation and pattern
recognition.
● ML model takes input x and maps it to the output Y.
An increase in x
(mosquito nets) causes a
decrease in Y (malaria
cases).
Human Intuition Machine Learning Model
explanatory feature(s) model predicted outcome
x Number of
mosquito nets
Predicted Number of
people with malariaY*f(x) + e
Our model f(x) is a function that maps
our input x to a predicted Y*.
irreducible
error
Define f(x)
Predicted Number of
people with malariaf(x) + e = Y*
The goal of f(x) is to predict a Y* as
close to the true Y as possible.
e is irreducible error. This captures error caused by factors like measurement error,
randomness in the data, and inappropriate model choice. No matter how well you
optimize your model, this will never be reduced to 0.
My job is to make
the predictions as
useful as possible!
Our function f(x) maps an input x to
a predicted Y, which we refer to as
Y*. We want to choose an f(x) that
will map x as close to the true Y as
possible.
Define f(x)
True Y
We want Y* to be close to true Y
because we want the function to
output useful predictions.
Y*=f(x)+e
In this example,
predicted Y appears
close to the true Y.
We will talk about
how to quantify this
in the next section.
Define f(x)
True Y
We want Y* to be close to true Y because
we want the function to output useful
predictions.
Y*=f(x)+e
In this example,
predicted Y appears
far from the true.
This is probably not
very useful. We will
talk about how to
quantify this in the
next section.
Define f(x)
What is f(x)? It depends on the machine
learning algorithm we choose.
Define f(x)
x f(x) Y*
explanatory
feature(s), like
number of
mosquito nets
predicted outcome,
e.g. Number of
people with malaria
Supervised learning algorithms:
- Linear regression
- Decision tree
- Random forest
- ...
Unsupervised learning
algorithms:
- K-means clustering
- Hierarchical clustering
- ...
Examples of f(x):
When you have labelled data When you don’t have labelled data
Research question
Machine learning task
x
Number of
mosquito nets
Number of people
with malariaY*
Let’s bring everything together.
f(x)
Bring it all
together
How does the number of cases of malaria
change when the number of mosquito nets
changes?
Regression
Classification
Continuous variable
Categorical variable
A classification problem is when
we are trying to predict whether
something belongs to a category,
such as “red” or “blue” or
“disease” and “no disease”.
A regression problem is when we
are trying to predict a numerical
value, such as “dollars” or
“weight”.
Source: Andrew Ng, Stanford CS229 Machine Learning Course
The task function depends upon the data type you
want to predict. Supervised learning problems fall into
two main categories: regression & classification.
Supervised
learning task
The Task
2. Learning Methodology
Learning
Methodology
ML algorithms can be supervised or
unsupervised. This determines the
learning methodology.
Learning
Methodology
Is our model supervised or
unsupervised? How does that affect
the learning processing?
What is our
loss function?
Every supervised model has a
loss function it wants to
minimize.
Optimization
process
How does the model minimize
the loss function?
Learning methodology: how does the model learn the
function that best maps x to the true Y?
How does
our ML
model learn?
Overview of how the model
teaches itself.
Recall that machine learning is a subset of data science
that allows machines to learn from raw data.
Traditional software programing involves giving machines instructions which they perform.
Machine learning involves allowing machines to learn from raw data so that the computer
program can change when exposed to new data (learning from experience).
Source: https://www.youtube.com/watch?v=IpGxLWOIZy4
Machine learning
Machine learning is a subset of
data science that allows machines
to learn from raw data.
Learning
Methodology
What do we mean when we say a machine
“learns from experience”?
How does the
model learn from
raw data?
How the algorithm learns depends upon
type of data you have.
START:
Research question
Do I have data that may
answer my question?
Gather or find more data Do you have labelled data?
Unsupervised learning
methods
Supervised learning
methods
You are here!
YN
YN
Learning
Methodology
What does labelled data mean?
Do you have labelled data?
The outcome feature (Y)
is not recorded in the
data. You do not have a
labelled Y.
The outcome feature (Y)
you are interested in
predicting is recorded in
the data. If you have a
labelled Y, you can use
supervised learning
methods.
Learning
Methodology
Y=Number of people
with malaria
2007: 80
2008: 40
2009: 42
2010: 35
Y=Number of people
with malaria
2007:
2008:
2009:
2010:
No
Yes
Whether or not you have labelled data
determines whether it is a Supervised or
Unsupervised Learning problem
Unsupervised Learning
● For every x, there is no Y
● Goal is not to predict, but to
investigate x
Supervised Learning
● For every x, there is a Y
● Goal is to predict Y using x
Learning
Methodology
Do you have labelled data?
Yes
No
f(x) + e = Y*
Y is in your data
Y is not in your
data
Most problems you will initially
encounter are supervised
algorithms.
How do supervised algorithms learn?
Intuitive explanation for how supervised
algorithms learn:
supervised learning
Number of people
with malaria
Y
2007: 80
2008: 40
2009: 42
2010: 35
Imagine you are a teacher and you ask
your students a question.
The labels Y provide the correct answer
for the problem the students are trying to
solve. Since you know the correct answer,
you can reward good student
performance and punish poor
performance. This encourages ongoing
learning!
Extending this example, you (the
researcher) are the teacher and the Model
is the student.
supervised learning
Model
I want to get the correct
answer for predicting Y
and be the best student in
the class.
Great Mr. Model!
Once you give me
your answer I will
let you know the
correct answer.
Every time Mr. Model predicts Y*, you compare
Y* to the true Y to see how well he did.
Our Model starts trying to provide an
estimated Y* by guessing.
supervised learning
Model
Unsurprisingly, the results appear terrible, judging from the
fact that actual numbers are very different from the predicted
numbers. To quantify how bad or good results are, we use Y-Y*.
I have never seen this
problem before! I’ll
just start by randomly
guessing an answer
and see what happens.
Predicted Number of
people with malaria
Y* Y
Actual Number of
people with malaria
2007: 80
2008: 40
2009: 42
2010: 35
2007: 1
2008: 2000
2009: 300
2010: 40
True Y
Which model is more useful at
mapping x close to true Y?
Y*=f(x)+e
True Y
Y*=f(x)+e
Learning
Methodology
What is our
loss function?
Which prediction was worse, a) or b)?
a) b)
We can immediately tell that
b is better!
Learning
Methodology
What is our
loss function?
We can see that the f(x) in b maps x to a Y* much closer to the
true Y. A loss function allows us to quantify this difference.
True Y
Y*=f(x)+e
True Y
Y*=f(x)+e
a) b)
A loss function quantifies how
unhappy you would be if you used
f(x) to predict Y* when the
correct output is y. It is what we
want to minimize.
Another way to think about it is
that a loss function quantifies
how well our f(x) fits our data.
The Task The Loss
Function
A model’s goal is to minimize the loss function.
Source: Stanford ML Lecture 1
We have already seen one simple example: Y-Y*, or the difference between the predicted Y and the
actual Y. Later, we will see more sophisticated loss functions.
supervised learning
We first decide on how to measure how unhappy we are with
these results. We call this our loss function. On the next slide,
we show a few different possible loss functions we can use to
assess Mr. Model.
Since I know the right
answer, I can compare
predicted Y* to the
true Y to help guide
Mr. Model.
Predicted Number of
people with malaria
Y* Y
Actual Number of
people with malaria
2007: 80
2008: 40
2009: 42
2010: 35
2007: 1
2008: 2000
2009: 300
2010: 40
Regression
Classification
Continuous variable
Categorical variable
A classification problem is when
are trying to predict whether
something belongs to a category,
such as “red” or “blue” or
“disease” and “no disease”.
A regression problem is when we
are trying to predict a numerical
value given some input, such as
“dollars” or “weight”.
Source: Andrew Ng, Stanford CS229 Machine Learning Course
Recall that there are two different types of tasks:
Supervised
learning task
The Task
Regression
Classification
Continuous variable
Categorical variable
Source: Andrew Ng, Stanford CS229 Machine Learning Course
The choice of loss function depends upon
the type of task. We will discuss loss
functions for both types of task.
Learning
Methodology
What is our
loss function?
log loss
absolute error
(L1)
least squares error
(L2)
hinge loss
Including mean squared
error (MSE), root mean
squared error (RMSE)
Learning
Methodology
What is our
loss function?
absolute error
(L1)
least squares error
(L2)
mean squared error
(MSE)
There are a few different loss
functions we could choose from,
depending on the problem we are
trying to solve.
log loss
hinge loss
root mean squared
error (RMSE)
Regression Classification
Continuous variable Categorical variable
Regression loss functions
1. L1 norm (mean absolute error)
2. L2 norm (least squares error)
- Mean squared error
Our outcome feature is continuous:
the number of people who have
malaria.
Learning
Methodology
What is our
loss function?
Outcome feature in
data is continuous
Number of people
with malariaY
2007: 80
2008: 40
2009: 42
2010: 35
Regression
task
L1 or L2 loss
function
L1 and L2 loss are two possible options
for assessing how unhappy we are with
Mr. Model’s choice of f(x).
Learning
Methodology
What is our
loss
function?
absolute error
(L1)
least squares error
(L2)
mean squared error
Also called L1 loss, this
minimizes the sum of
absolute errors between
True Y and predicted Y*.
Also called L2 loss, this
minimizes the square of
the error between True Y
and predicted Y*.
Takes the mean of the L2
loss over all observations.
Source: L1 and L2
What is our
loss function?
How bad were Mr. Model’s initial results?
Let’s compute the L1 norm.
Learning
Methodology
How well did Mr.
Model's random
guess perform?
Source: Intro to Stat - Introduction to Linear Regression
Predicted Number of
people with malaria
Y* Y
Actual Number of
people with malaria
2007: 80
2008: 40
2009: 42
2010: 35
2007: 1
2008: 2000
2009: 300
2010: 40
absolute error (L1)
(|1-80|+|2000-40|+|300-42|+|40-35|)
= 2,302
What is our
loss function?
How bad where Mr. Model’s initial
results? Let’s compute the L2 norm.
Learning
Methodology
How did Mr. Model’s
initial random guess
do?
Source: Intro to Stat -Introduction to Linear Regression
Predicted Number of
people with malaria
Y* Y
Actual Number of
people with malaria
2007: 80
2008: 40
2009: 42
2010: 35
2007: 1
2008: 2000
2009: 300
2010: 40
Least squares error
(L2)
(80-1)^2+(40-2000)^2+(42-300)^2
+(35-40)^2=3,914,430
We can normalize our L2 loss by
computing mean squared error or root
mean squared error.
Learning
Methodology
What is our
loss
function?
least squares error
(L2)
mean squared error
Also called L2 loss,
minimizes the square of
the error between True Y
and predicted Y*.
Takes the mean of the L2
loss over all observations.
Source: L1 and L2,
root mean squared
error
Takes the square root of
the mean of the L2 loss.
RMSE = sqrt(mean(S))MSE = mean(S)
What is our
loss function?
Mean squared error takes the
average L2 error per
observation.
Learning
Methodology
How did Mr. Model’s
initial random guess
do?
Source: Intro to Stat -Introduction to Linear Regression
Predicted Number of
people with malaria
Y* Y
Actual Number of
people with malaria
2007: 80
2008: 40
2009: 42
2010: 35
2007: 1
2008: 2000
2009: 300
2010: 40
Mean squared error
((80-1)^2+(40-2000)^2+(42-300)^2
+(35-40)^2)/4
= 978,607.5
MSE = mean(S)
What is our
loss function?
Root mean squared error takes
the square root of the average
L2 error per observation.
Learning
Methodology
Source: Intro to Stat -Introduction to Linear Regression
Predicted Number of
people with malaria
Y* Y
Actual Number of
people with malaria
2007: 80
2008: 40
2009: 42
2010: 35
2007: 1
2008: 2000
2009: 300
2010: 40
Root mean squared
error
(((80-1)^2+(40-2000)^2+(42-
300)^2+(35-40)^2)/4)^(½)
= 989.25
How did the models
initial random guess
do?
RMSE = sqrt(mean(S))
We can compute for each loss functions how
unhappy we are with models initial random guess.
Learning
Methodology
absolute error (L1)
least squares error
(L2)
mean squared error
Also called L1 loss,
minimizes the sum of
absolute errors between
True Y and predicted Y*.
Also called L2 loss,
minimizes the square of
the error between True Y
and predicted Y*.
Takes the average L2 loss
per observation in the
data.
Source: L1 and L2
2,302 3,914,430 978,608
root mean squared
error
Takes the square root of
the average L2 loss per
observation in the data.
989
RMSE = sqrt(mean(S))MSE = mean(S)
Don’t worry about these numbers. What’s important is you understand how we are transforming them step by step.
This job isn’t done
until I reduce
RMSE.
Source: Intro to Stat -Introduction to Linear Regression
There are five steps to RMSE:
Y-Y* For every observation in our
dataset, measure the difference
between true Y and predicted Y.
^2 Square each Y-Y* to get the
absolute distance, so positive
values don’t cancel out negative
ones when we sum.
Sum Sum across all observations so we
get the total error.
mean Divide the sum by the number of
observations we have.
root Take the square root of the mean
calculated above.
Model
RMSE is the square root of the average L2 loss per
observation.
MSE
Learning
Methodology
Which loss function should we use?
1. L1 norm (mean absolute error)
2. L2 norm (least squares error)
Each loss function has important pros
and cons.
Learning
Methodology
What is our
loss
function?
absolute error (L1)
least squares error
(L2)
Source: L1 and L2
Robust? Stable Solution? How many
solutions?
L1 Robust Not stable Multiple possible
solutions
L2 Not very robust Stable One possible
solution
vs.
MSE and RMSE are both normalized versions of L2
error. If we decide to use L2, we will choose MSE or
RMSE.
If we decide to use least squares
error (L2), we may decide to report
RMSE OR MSE
Learning
Methodology
What is our
loss
function?
MSE RMSEvs.
The key difference between RMSE and MSE is that taking the root in
RMSE normalizes the error to the same units of measurement.
This makes the error term more interpretable.
Both MSE and RMSE amplify and severely penalize large
errors more than small ones by squaring the error.
Classification loss functions
1. Log loss
2. Hinge loss
X Y
Temperature
Let’s define a slightly different task so
we can discuss hinge and log loss.
Define explanatory
and outcome
feature
Of patient
39.5°C
37.8°C
37.2°C
37.2°C
Does the
patient have
malaria?
No
Yes
Yes
No
Task: We want to predict whether or not a patient has malaria
using their temperature.
This is a classification task, so we can
use either log loss or hinge loss.
But first, what is a classification task?
Our outcome feature is categorical:
we want to predict whether or not
someone has malaria. This is a binary
classification problem.
Learning
Methodology
What is our
loss function?
Classification tasks output the probability of
belonging to a class. Normally, based upon a
threshold of 50% we then assign the predicted
class.
Does the model
predict that the
patient has
malaria?
Yes
Yes
Yes
No
Classify based
upon probability
threshold
outcome feature predicted probability predicted outcome
Y Y*Does the
patient have
malaria?
No
Yes
Yes
No
What is the probability
that the patient has
malaria?
0.55
0.80
0.85
0.2
We can evaluate accuracy by looking just at the
predicted outcome vs. the actual outcome. Here,
accuracy is 75%!
Does the model
predict that the
patient has
malaria?
Yes
Yes
Yes
No
Classify based
upon probability
threshold
outcome feature predicted probability predicted outcome
Y Y*Does the
patient have
malaria?
No
Yes
Yes
No
What is the probability
that the patient has
malaria?
0.55
0.80
0.85
0.2
However, we are missing out on using probability, which is important
information about how certain the model is about its prediction. Let’s look at
some loss functions that utilize this metric.
For every prediction the model Makes, we can
measure the logarithmic loss. What is logarithmic
loss?
Source: https://www.r-bloggers.com/making-sense-of-logarithmic-loss/
Log loss
I’m only 60%
sure that the
answer is
Yes…
Model
Log loss
evaluates the
probability of
belonging in a
class
- The smaller the log loss, the smaller the
uncertainty, the better the model
- A perfect classifier would have log loss = 0
- Log loss heavily penalizes classifiers that are
confident about an incorrect classification
- Ways to improve log loss:
- Are there problematic errors in dataset?
- Do we want to smooth the probabilities?
For every prediction our model Makes, we can also
measure the hinge loss. What is hinge loss?Hinge loss
Hinge loss is the logical extension of the regression loss function, absolute loss.
Absolute loss: Y- Y*, where Y and Y* are integers.
Hinge loss: max(0,1-(Y*)(Y))
Where Y can equal -1 (no) or 1 (yes) for each class.
For each observation, if Y* == Y (both are 1 or both are -1), hinge loss = 0. If Y =/= Y*, hinge loss increases.
The cumulated hinge loss is therefore the upper bound of the number of mistakes made by the classifier.
Sources: https://en.wikipedia.org/wiki/Hinge_loss;
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html
How do we choose a loss function for a
classification problem?
Learning
Methodology
What is our
loss function?
Log Loss Hinge Lossvs.
Leads to more exact
probabilities, but at the
cost of accuracy
Leads to better accuracy,
but at the cost of exact
probabilities
Depends on the question you want to answer!
E.g. For a problem where we are trying to assess patient health, we know
that false positives (the model predicts you do have malaria, but you actually
don’t) are safer and generally more preferable than false negatives (the
model predicts you don’t have malaria, but you actually do.)
Therefore it is probably safer to evaluate our output as a probability of
whether or not you have malaria. We will use log loss.
How do we choose a loss function for a
classification problem?
Learning
Methodology
What is our
loss function?
We provide only a broad conceptual overview of log loss
and hinge loss as we will not be using them in our coding
lab. However, we encourage you to explore them further.
More resources can be found at the end of this module.
A note on categorical loss functions ...Learning
Methodology
What is our
loss function?
Our model computed an initial guess
using RMSE.
How does he improve on his initial
guess?
What is our
loss function?
Our initial RMSE is very high. Our model
tries a different f(x) and compares
RMSE.
Learning
Methodology
Source: Intro to Stat -Introduction to Linear Regression
If the new f(x) reduces the loss, our
model keeps changing the f(x) in
that direction. After every change,
the model measures whether the
loss has increased, decreased or
stayed the same.
Oh no! That wasn’t
very good, let’s try
something else!
1st initial
guess
2nd update 3rd update
RMSE 1,000 1,300 800
# nets 300 100 400
As the model updates the prediction at
each step, we see that themodel is
learning - it changes in response to a
higher MSE.
Learning
Methodology
How does
our ML
model learn?
The process of changing f(x) to reduce
the loss function is called learning. It is
what makes ordinary least squares (OLS)
regression a machine learning algorithm.
For every f(x) we choose there is
an associated loss.
The learning process involves
updating f(x) in order to reach
the global minimum loss.
Our
model’s
initial
random
f(x)
gives us
a high
initial
error
Our model starts with a random f(x) and
update f(x) to make our loss as small as
possible.
Learning
Methodology
Our model’s job is to change
the parameters so that
every time he changes f(x),
the loss goes down.
Our model succeeds when he reduces the
error to its minimum.
How does
our ML
model learn?
When does our model stop?Learning
Methodology
Our model’s job is to change
the parameters so that
every time he changes f(x),
the loss goes down.
Our model succeeds when he reduces the
error to its minimum.
How does
our ML
model learn?
What if we don’t have labelled
data?
Learning
Methodology
How does
our ML
model learn?
Do you have labelled
data?
The outcome feature (Y)
is not recorded in the
data. You do not have a
labelled Y.
The outcome feature (Y)
you are interested in
predicting is recorded in
the data. If you have a
labelled Y, you can use
supervised learning
methods.
Y=Number of
people with
malaria
2007: 80
2008: 40
2009: 42
2010: 35
Y=Number of people
with malaria
2007:
2008:
2009:
2010:
No
Yes
When does our model stop if we
don’t have labelled data?
Learning
Methodology
How does
our ML
model learn?
Our model can update the
parameters only if we have
labelled data.
What happens if we don’t actually
have Y in our data?
We turn to unsupervised learning
techniques
Unsupervised Algorithms
● For every x, there is no Y.
● We do not know the correct answers so we
cannot act as a teacher.
● Instead, we try to get an understanding of
the distribution of x to draw inference about
Y.
Source: Machine Learning Mastery - Supervised and Unsupervised Learning
Learning
Methodology
Unsupervised learning does not have labelled data. However,
it is the most promising current area of research in
machine learning. Unlocking unsupervised learning will
fundamentally change our world.
To reach true machine intelligence, ML needs to
get better at unsupervised learning.
Humans learn mostly through unsupervised
learning: we absorb vast amounts of data from
our surroundings without needing a label.
Learning
Methodology
Why is unsupervised learning important?
Unsupervised
learning is the
cake itself
Supervised
learning is the
icing on the
cake
Yan Lecun, a deep learning researcher, made the analogy that
if intelligence was a cake, unsupervised learning would be the
cake and supervised learning would be the icing on the cake.
We know how to make the icing, but we don’t know how to
make the cake. Unsupervised learning is the holy grail of
machine learning.
Learning
Methodology
Examples of unsupervised learning: This clustering
algorithm predicts a user’s friends based upon their
activity on social networks.
Source: Quora - What is unsupervised learning with an example.
We don’t have any labelled data that tells us any
node is friends with another node (i.e., x is friends
with Y).
Instead, we can use user interactions to provide
the labels. The assumption is that if you are
interacting heavily with someone they are more
likely to be your friend.
Learning
Methodology
The clustering algorithm uses email patterns to predict
the hierarchy of a business organization.
Source: Automated Social Hierarchy Detection Through Email Network Analysis - Academic Paper
This algorithm not only takes in account
the people involved in an interaction but
also the directionality of the interaction.
Based on email
interactions, who is
the boss?
In the next class, we will
introduce model
selection and evaluation.
Let’s quickly recap what
we have learnt in this
module.
Task
Performance
Measure
What is the task and the performance measure
for our malaria net example?
Task
Accurately predict the decrease in
malaria cases for a given increase in
mosquito nets
Performance
Measure
Mean squared error
What is the task and the performance measure
for our malaria net example?
Task
Performance
Measure
What is the task and the performance measure
for our malaria patient example?
Task
Accurately predict whether or not a
given patient is likely to have
malaria, given his or her
temperature.
Performance
Measure
Log Loss
What is the task and the performance measure
for our malaria patient example?
You are on fire! Go straight to the
next module here.
Need to slow down and digest? Take a
minute to write us an email about
what you thought about the course. All
feedback small or large welcome!
Email: sara@deltanalytics.org
Find out more about
Delta’s machine
learning for good
mission here.
Additional resources
Additional resources
Rosasco, Lorenzo, et al. "Are loss functions all the same?." Neural Computation 16.5
(2004): 1063-1076. http://web.mit.edu/lrosasco/www/publications/loss.pdf
“Loss Functions for Regression and Classification”, David Rosenberg, NYU:
https://davidrosenberg.github.io/ml2015/docs/3a.loss-functions.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Simplilearn
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning IntroductionYounesCharfaoui
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logicAmey Kerkar
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic RegressionKnoldus Inc.
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Usama Fayyaz
 

Was ist angesagt? (20)

Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Object Recognition
Object RecognitionObject Recognition
Object Recognition
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Machine learning
Machine learningMachine learning
Machine learning
 
07 regularization
07 regularization07 regularization
07 regularization
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logic
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 

Ähnlich wie Module 2: Machine Learning Deep Dive

Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Supervised learning
Supervised learningSupervised learning
Supervised learningJohnson Ubah
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised LearningSara Hooker
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeOsama Ghandour Geris
 
Machine Learning: Need of Machine Learning, Its Challenges and its Applications
Machine Learning: Need of Machine Learning, Its Challenges and its ApplicationsMachine Learning: Need of Machine Learning, Its Challenges and its Applications
Machine Learning: Need of Machine Learning, Its Challenges and its ApplicationsArpana Awasthi
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionCamella Taylor
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Statistical foundations of ml
Statistical foundations of mlStatistical foundations of ml
Statistical foundations of mlVipul Kalamkar
 
Machine Learning Ch 1.ppt
Machine Learning Ch 1.pptMachine Learning Ch 1.ppt
Machine Learning Ch 1.pptARVIND SARDAR
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptxchadhar227
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Saurabh Kaushik
 

Ähnlich wie Module 2: Machine Learning Deep Dive (20)

Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Lec1 intoduction.pptx
Lec1 intoduction.pptxLec1 intoduction.pptx
Lec1 intoduction.pptx
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised Learning
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-code
 
Machine Learning: Need of Machine Learning, Its Challenges and its Applications
Machine Learning: Need of Machine Learning, Its Challenges and its ApplicationsMachine Learning: Need of Machine Learning, Its Challenges and its Applications
Machine Learning: Need of Machine Learning, Its Challenges and its Applications
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
PREDICT 422 - Module 1.pptx
PREDICT 422 - Module 1.pptxPREDICT 422 - Module 1.pptx
PREDICT 422 - Module 1.pptx
 
Statistical foundations of ml
Statistical foundations of mlStatistical foundations of ml
Statistical foundations of ml
 
Machine Learning Ch 1.ppt
Machine Learning Ch 1.pptMachine Learning Ch 1.ppt
Machine Learning Ch 1.ppt
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning
 
Shap
ShapShap
Shap
 

Mehr von Sara Hooker

Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2Sara Hooker
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble AlgorithmsSara Hooker
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision TreesSara Hooker
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparationSara Hooker
 
Storytelling with Data (Global Engagement Summit at Northwestern University 2...
Storytelling with Data (Global Engagement Summit at Northwestern University 2...Storytelling with Data (Global Engagement Summit at Northwestern University 2...
Storytelling with Data (Global Engagement Summit at Northwestern University 2...Sara Hooker
 
Delta Analytics Open Data Science Conference Presentation 2016
Delta Analytics Open Data Science Conference Presentation 2016Delta Analytics Open Data Science Conference Presentation 2016
Delta Analytics Open Data Science Conference Presentation 2016Sara Hooker
 

Mehr von Sara Hooker (7)

Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Storytelling with Data (Global Engagement Summit at Northwestern University 2...
Storytelling with Data (Global Engagement Summit at Northwestern University 2...Storytelling with Data (Global Engagement Summit at Northwestern University 2...
Storytelling with Data (Global Engagement Summit at Northwestern University 2...
 
Delta Analytics Open Data Science Conference Presentation 2016
Delta Analytics Open Data Science Conference Presentation 2016Delta Analytics Open Data Science Conference Presentation 2016
Delta Analytics Open Data Science Conference Presentation 2016
 

Kürzlich hochgeladen

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Module 2: Machine Learning Deep Dive

  • 2. This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good. Please reach out with any questions or feedback to inquiry@deltanalytics.org. Find out more about our mission here. Delta Analytics builds technical capacity around the world.
  • 4. Course overview: Now let’s turn to the data we will be using... ✓ Module 1: Introduction to Machine Learning ✓ Module 2: Machine Learning Deep Dive ❏ Module 3: Model Selection and Evaluation ❏ Module 4: Linear Regression ❏ Module 5: Decision Trees ❏ Module 6: Ensemble Algorithms ❏ Module 7: Unsupervised Learning Algorithms ❏ Module 8: Natural Language Processing Part 1 ❏ Module 9: Natural Language Processing Part 2
  • 5. ❏ Model Development ❏ Defining the machine learning task ❏ Measuring performance of your model ❏ Supervised vs. unsupervised learning methods ❏ Model Validation Module Checklist
  • 7. Now we have our research question, we are able to start modeling! Research Question Exploratory Analysis Modeling Phase Performance Data Cleaning Learning Methodology Task Research question from Module 1: ● How does loan amount requested vary by town?
  • 8. Now we have our research question, we are able to start modeling! Research Question Exploratory Analysis Modeling Phase Performance Data Cleaning We are here! Learning Methodology Task We will discuss model performance in the next module In this module we introduce the first two steps of modeling: ● Defining the machine learning task ● Understanding how the machine learns
  • 9. Machine learning allows us to tackle tasks that are too difficult to code all possible approaches to on our own. By allowing machines to learn from experience, we avoid the need for humans to specify all the knowledge a computer needs. Let’s start at the basics. Why do we want to build a model? Modeling Phase Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning
  • 10. Based on our experience of the world, we have an understanding of relationships between features Machine learning models quantify and learn the patterns we observe in data. Computers acquire human intuition and quantify it, by extracting patterns from raw data Human Intuition Machine Learning Model
  • 11. Task What is the problem we want our model to solve? Performance Measure Quantitative measure we use to evaluate the model’s performance. Learning Methodology ML algorithms can be supervised or unsupervised. This determines the learning methodology. All models have 3 key components: a task, a learning methodology and a performance measure Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning Modeling Phase
  • 12. Task What is the problem we want our model to solve? Defining f(x) What function will map our x (input) as close as possible to the true Y (output). Feature engineering & selection What is x? How do we decide what explanatory features to include in our model? Today we are looking closer at each component of the framework: What assumptions does our model make about the data? Do we have to transform the data? Is our f(x) correct for this problem?
  • 13. Learning Methodology Is our model supervised or unsupervised; how does that affect the learning processing? What is our loss function? Every supervised model has a loss function it wants to minimize. Optimization process How does the model minimize the loss function. Learning methodology: how does the model learn the function that best maps x to the true Y? How does our ML model learn? Overview of how the model teaches itself.
  • 14. Performance Quantitative measure we use to evaluate the model’s performance. Measures of performance R2, Adjusted R2, MSE Feature Performance Statistical significance, p-values Overfitting, underfitting, bias, variance Ability to generalize to unseen data Performance: How do we evaluate how useful the model is, and how we can improve it?
  • 16. Supervised learning Unsupervised learning Hold on! Important disclosure: ML Algorithm Labelled Data (we have Y in our data) No Labelled Data (we don’t have Y in our data) For the next few slides, we will be introducing the intuition behind machine learning models using supervised learning examples. Later in this module, we will explore how unsupervised learning is different.
  • 17. 1. Task Task What is the problem we want our model to solve?
  • 18. Research Question Recap: How does loan amount requested on Kiva vary by town in Kenya? We have KIVA data about loan amount requested by borrowers all over Kenya. We want to know how the loan amount requested varies by town.
  • 19. Building a model involves turning your research question into a machine learning question. Task Research question Machine learning task How does loan amount requested on Kiva vary by town? ??
  • 20. Firstly, let’s establish a common vocabulary to talk about the data. Source: Subset of Kenyan loans dataset. Features Observations Task Location.town is an example of a feature. Every column in our data set is a feature! Every row of our dataset is an observation. When we include the observation in our model it is part of our training set.
  • 21. A machine learning task has explanatory features and an outcome feature. Outcome featureExplanatory features Town borrower lives in Loan amount requested Task An outcome feature is the feature we expect to change when the explanatory features are manipulated. In this example, we expect the loan amount to change when we change the location.
  • 22. What would the outcome features and explanatory features be in the research questions below? Try identifying some: ● What will the price of a stock be tomorrow? ● Does this patient have malaria? ● Would this person buy a car?
  • 23. Solutions: Explanatory features Outcome feature Price of a stock market index today Company X’s stock price tomorrow Age, symptoms, travel history Whether or not a patient has malaria Income, location Whether or not a person would buy a car The outcome feature might be regression (e.g. $12) or a classification (e.g. Yes or No). We’ll talk about this more later!
  • 24. Let’s define our explanatory and outcome features for this task Problem: I am the mayor of a 30,000 person town and need to justify spending budget on mosquito nets. I want evidence on how the number of mosquito nets affects the number of cases of malaria. Can you help? Task
  • 25. Let’s start by identifying the research question! The research question is what we want to find out from the data, formally stated. 1.Research Question 2.Task How does the number of cases of malaria change when the number of mosquito nets changes?
  • 26. Next let’s define our task! Task Define explanatory and outcome feature Define f(x) Bring it all together 1 2 3
  • 27. explanatory feature(s) outcome feature x Number of mosquito nets Number of people with malariaY 2007: 80 2008: 40 2009: 42 2010: 35 We also call our explanatory features x, and our outcome feature Y. Looks like as mosquito nets increase, the number of malaria cases decreases. 2007: 1000 2008: 2200 2009: 6600 2010: 12600 How does the number of cases of malaria change when the number of mosquito nets changes? Define explanatory and outcome feature
  • 28. What would you conclude from looking at this data? How many nets would you recommend? Task x Number of mosquito nets Number of people with malariaY 2007: 80 2008: 40 2009: 42 2010: 35 2007: 1000 2008: 2200 2009: 6600 2010: 12600 You came to a conclusion by recognizing a pattern in the data. This is similar to how a machine learning algorithm would approach the same problem.
  • 29. Machine learning allows us to learn from historical patterns. If Mr. Mayor had no machine learning methods to use, he could find an answer by trying a different # of nets year after year. But this has an obvious human cost, and it would be very hard to update the model to account for, for example, new residents to his town. Machine learning algorithms help answer questions without this human cost - we are learning from data, or in other words, learning from history! Task
  • 30. “Over four years, increasing number of mosquito nets decrease the number of malaria cases.” ● Humans form rules based upon observation and pattern recognition. ● ML model takes input x and maps it to the output Y. An increase in x (mosquito nets) causes a decrease in Y (malaria cases). Human Intuition Machine Learning Model
  • 31. explanatory feature(s) model predicted outcome x Number of mosquito nets Predicted Number of people with malariaY*f(x) + e Our model f(x) is a function that maps our input x to a predicted Y*. irreducible error Define f(x)
  • 32. Predicted Number of people with malariaf(x) + e = Y* The goal of f(x) is to predict a Y* as close to the true Y as possible. e is irreducible error. This captures error caused by factors like measurement error, randomness in the data, and inappropriate model choice. No matter how well you optimize your model, this will never be reduced to 0. My job is to make the predictions as useful as possible! Our function f(x) maps an input x to a predicted Y, which we refer to as Y*. We want to choose an f(x) that will map x as close to the true Y as possible. Define f(x)
  • 33. True Y We want Y* to be close to true Y because we want the function to output useful predictions. Y*=f(x)+e In this example, predicted Y appears close to the true Y. We will talk about how to quantify this in the next section. Define f(x)
  • 34. True Y We want Y* to be close to true Y because we want the function to output useful predictions. Y*=f(x)+e In this example, predicted Y appears far from the true. This is probably not very useful. We will talk about how to quantify this in the next section. Define f(x)
  • 35. What is f(x)? It depends on the machine learning algorithm we choose. Define f(x) x f(x) Y* explanatory feature(s), like number of mosquito nets predicted outcome, e.g. Number of people with malaria Supervised learning algorithms: - Linear regression - Decision tree - Random forest - ... Unsupervised learning algorithms: - K-means clustering - Hierarchical clustering - ... Examples of f(x): When you have labelled data When you don’t have labelled data
  • 36. Research question Machine learning task x Number of mosquito nets Number of people with malariaY* Let’s bring everything together. f(x) Bring it all together How does the number of cases of malaria change when the number of mosquito nets changes?
  • 37. Regression Classification Continuous variable Categorical variable A classification problem is when we are trying to predict whether something belongs to a category, such as “red” or “blue” or “disease” and “no disease”. A regression problem is when we are trying to predict a numerical value, such as “dollars” or “weight”. Source: Andrew Ng, Stanford CS229 Machine Learning Course The task function depends upon the data type you want to predict. Supervised learning problems fall into two main categories: regression & classification. Supervised learning task The Task
  • 38. 2. Learning Methodology Learning Methodology ML algorithms can be supervised or unsupervised. This determines the learning methodology.
  • 39. Learning Methodology Is our model supervised or unsupervised? How does that affect the learning processing? What is our loss function? Every supervised model has a loss function it wants to minimize. Optimization process How does the model minimize the loss function? Learning methodology: how does the model learn the function that best maps x to the true Y? How does our ML model learn? Overview of how the model teaches itself.
  • 40. Recall that machine learning is a subset of data science that allows machines to learn from raw data. Traditional software programing involves giving machines instructions which they perform. Machine learning involves allowing machines to learn from raw data so that the computer program can change when exposed to new data (learning from experience). Source: https://www.youtube.com/watch?v=IpGxLWOIZy4 Machine learning
  • 41. Machine learning is a subset of data science that allows machines to learn from raw data. Learning Methodology What do we mean when we say a machine “learns from experience”? How does the model learn from raw data?
  • 42. How the algorithm learns depends upon type of data you have. START: Research question Do I have data that may answer my question? Gather or find more data Do you have labelled data? Unsupervised learning methods Supervised learning methods You are here! YN YN Learning Methodology
  • 43. What does labelled data mean? Do you have labelled data? The outcome feature (Y) is not recorded in the data. You do not have a labelled Y. The outcome feature (Y) you are interested in predicting is recorded in the data. If you have a labelled Y, you can use supervised learning methods. Learning Methodology Y=Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 Y=Number of people with malaria 2007: 2008: 2009: 2010: No Yes
  • 44. Whether or not you have labelled data determines whether it is a Supervised or Unsupervised Learning problem Unsupervised Learning ● For every x, there is no Y ● Goal is not to predict, but to investigate x Supervised Learning ● For every x, there is a Y ● Goal is to predict Y using x Learning Methodology Do you have labelled data? Yes No f(x) + e = Y* Y is in your data Y is not in your data
  • 45. Most problems you will initially encounter are supervised algorithms. How do supervised algorithms learn?
  • 46. Intuitive explanation for how supervised algorithms learn: supervised learning Number of people with malaria Y 2007: 80 2008: 40 2009: 42 2010: 35 Imagine you are a teacher and you ask your students a question. The labels Y provide the correct answer for the problem the students are trying to solve. Since you know the correct answer, you can reward good student performance and punish poor performance. This encourages ongoing learning!
  • 47. Extending this example, you (the researcher) are the teacher and the Model is the student. supervised learning Model I want to get the correct answer for predicting Y and be the best student in the class. Great Mr. Model! Once you give me your answer I will let you know the correct answer. Every time Mr. Model predicts Y*, you compare Y* to the true Y to see how well he did.
  • 48. Our Model starts trying to provide an estimated Y* by guessing. supervised learning Model Unsurprisingly, the results appear terrible, judging from the fact that actual numbers are very different from the predicted numbers. To quantify how bad or good results are, we use Y-Y*. I have never seen this problem before! I’ll just start by randomly guessing an answer and see what happens. Predicted Number of people with malaria Y* Y Actual Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 2007: 1 2008: 2000 2009: 300 2010: 40
  • 49. True Y Which model is more useful at mapping x close to true Y? Y*=f(x)+e True Y Y*=f(x)+e Learning Methodology What is our loss function? Which prediction was worse, a) or b)? a) b)
  • 50. We can immediately tell that b is better! Learning Methodology What is our loss function? We can see that the f(x) in b maps x to a Y* much closer to the true Y. A loss function allows us to quantify this difference. True Y Y*=f(x)+e True Y Y*=f(x)+e a) b)
  • 51. A loss function quantifies how unhappy you would be if you used f(x) to predict Y* when the correct output is y. It is what we want to minimize. Another way to think about it is that a loss function quantifies how well our f(x) fits our data. The Task The Loss Function A model’s goal is to minimize the loss function. Source: Stanford ML Lecture 1 We have already seen one simple example: Y-Y*, or the difference between the predicted Y and the actual Y. Later, we will see more sophisticated loss functions.
  • 52. supervised learning We first decide on how to measure how unhappy we are with these results. We call this our loss function. On the next slide, we show a few different possible loss functions we can use to assess Mr. Model. Since I know the right answer, I can compare predicted Y* to the true Y to help guide Mr. Model. Predicted Number of people with malaria Y* Y Actual Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 2007: 1 2008: 2000 2009: 300 2010: 40
  • 53. Regression Classification Continuous variable Categorical variable A classification problem is when are trying to predict whether something belongs to a category, such as “red” or “blue” or “disease” and “no disease”. A regression problem is when we are trying to predict a numerical value given some input, such as “dollars” or “weight”. Source: Andrew Ng, Stanford CS229 Machine Learning Course Recall that there are two different types of tasks: Supervised learning task The Task
  • 54. Regression Classification Continuous variable Categorical variable Source: Andrew Ng, Stanford CS229 Machine Learning Course The choice of loss function depends upon the type of task. We will discuss loss functions for both types of task. Learning Methodology What is our loss function? log loss absolute error (L1) least squares error (L2) hinge loss Including mean squared error (MSE), root mean squared error (RMSE)
  • 55. Learning Methodology What is our loss function? absolute error (L1) least squares error (L2) mean squared error (MSE) There are a few different loss functions we could choose from, depending on the problem we are trying to solve. log loss hinge loss root mean squared error (RMSE) Regression Classification Continuous variable Categorical variable
  • 56. Regression loss functions 1. L1 norm (mean absolute error) 2. L2 norm (least squares error) - Mean squared error
  • 57. Our outcome feature is continuous: the number of people who have malaria. Learning Methodology What is our loss function? Outcome feature in data is continuous Number of people with malariaY 2007: 80 2008: 40 2009: 42 2010: 35 Regression task L1 or L2 loss function
  • 58. L1 and L2 loss are two possible options for assessing how unhappy we are with Mr. Model’s choice of f(x). Learning Methodology What is our loss function? absolute error (L1) least squares error (L2) mean squared error Also called L1 loss, this minimizes the sum of absolute errors between True Y and predicted Y*. Also called L2 loss, this minimizes the square of the error between True Y and predicted Y*. Takes the mean of the L2 loss over all observations. Source: L1 and L2
  • 59. What is our loss function? How bad were Mr. Model’s initial results? Let’s compute the L1 norm. Learning Methodology How well did Mr. Model's random guess perform? Source: Intro to Stat - Introduction to Linear Regression Predicted Number of people with malaria Y* Y Actual Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 2007: 1 2008: 2000 2009: 300 2010: 40 absolute error (L1) (|1-80|+|2000-40|+|300-42|+|40-35|) = 2,302
  • 60. What is our loss function? How bad where Mr. Model’s initial results? Let’s compute the L2 norm. Learning Methodology How did Mr. Model’s initial random guess do? Source: Intro to Stat -Introduction to Linear Regression Predicted Number of people with malaria Y* Y Actual Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 2007: 1 2008: 2000 2009: 300 2010: 40 Least squares error (L2) (80-1)^2+(40-2000)^2+(42-300)^2 +(35-40)^2=3,914,430
  • 61. We can normalize our L2 loss by computing mean squared error or root mean squared error. Learning Methodology What is our loss function? least squares error (L2) mean squared error Also called L2 loss, minimizes the square of the error between True Y and predicted Y*. Takes the mean of the L2 loss over all observations. Source: L1 and L2, root mean squared error Takes the square root of the mean of the L2 loss. RMSE = sqrt(mean(S))MSE = mean(S)
  • 62. What is our loss function? Mean squared error takes the average L2 error per observation. Learning Methodology How did Mr. Model’s initial random guess do? Source: Intro to Stat -Introduction to Linear Regression Predicted Number of people with malaria Y* Y Actual Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 2007: 1 2008: 2000 2009: 300 2010: 40 Mean squared error ((80-1)^2+(40-2000)^2+(42-300)^2 +(35-40)^2)/4 = 978,607.5 MSE = mean(S)
  • 63. What is our loss function? Root mean squared error takes the square root of the average L2 error per observation. Learning Methodology Source: Intro to Stat -Introduction to Linear Regression Predicted Number of people with malaria Y* Y Actual Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 2007: 1 2008: 2000 2009: 300 2010: 40 Root mean squared error (((80-1)^2+(40-2000)^2+(42- 300)^2+(35-40)^2)/4)^(½) = 989.25 How did the models initial random guess do? RMSE = sqrt(mean(S))
  • 64. We can compute for each loss functions how unhappy we are with models initial random guess. Learning Methodology absolute error (L1) least squares error (L2) mean squared error Also called L1 loss, minimizes the sum of absolute errors between True Y and predicted Y*. Also called L2 loss, minimizes the square of the error between True Y and predicted Y*. Takes the average L2 loss per observation in the data. Source: L1 and L2 2,302 3,914,430 978,608 root mean squared error Takes the square root of the average L2 loss per observation in the data. 989 RMSE = sqrt(mean(S))MSE = mean(S) Don’t worry about these numbers. What’s important is you understand how we are transforming them step by step.
  • 65. This job isn’t done until I reduce RMSE. Source: Intro to Stat -Introduction to Linear Regression There are five steps to RMSE: Y-Y* For every observation in our dataset, measure the difference between true Y and predicted Y. ^2 Square each Y-Y* to get the absolute distance, so positive values don’t cancel out negative ones when we sum. Sum Sum across all observations so we get the total error. mean Divide the sum by the number of observations we have. root Take the square root of the mean calculated above. Model RMSE is the square root of the average L2 loss per observation. MSE Learning Methodology
  • 66. Which loss function should we use? 1. L1 norm (mean absolute error) 2. L2 norm (least squares error)
  • 67. Each loss function has important pros and cons. Learning Methodology What is our loss function? absolute error (L1) least squares error (L2) Source: L1 and L2 Robust? Stable Solution? How many solutions? L1 Robust Not stable Multiple possible solutions L2 Not very robust Stable One possible solution vs. MSE and RMSE are both normalized versions of L2 error. If we decide to use L2, we will choose MSE or RMSE.
  • 68. If we decide to use least squares error (L2), we may decide to report RMSE OR MSE Learning Methodology What is our loss function? MSE RMSEvs. The key difference between RMSE and MSE is that taking the root in RMSE normalizes the error to the same units of measurement. This makes the error term more interpretable. Both MSE and RMSE amplify and severely penalize large errors more than small ones by squaring the error.
  • 69. Classification loss functions 1. Log loss 2. Hinge loss
  • 70. X Y Temperature Let’s define a slightly different task so we can discuss hinge and log loss. Define explanatory and outcome feature Of patient 39.5°C 37.8°C 37.2°C 37.2°C Does the patient have malaria? No Yes Yes No Task: We want to predict whether or not a patient has malaria using their temperature.
  • 71. This is a classification task, so we can use either log loss or hinge loss. But first, what is a classification task? Our outcome feature is categorical: we want to predict whether or not someone has malaria. This is a binary classification problem. Learning Methodology What is our loss function?
  • 72. Classification tasks output the probability of belonging to a class. Normally, based upon a threshold of 50% we then assign the predicted class. Does the model predict that the patient has malaria? Yes Yes Yes No Classify based upon probability threshold outcome feature predicted probability predicted outcome Y Y*Does the patient have malaria? No Yes Yes No What is the probability that the patient has malaria? 0.55 0.80 0.85 0.2
  • 73. We can evaluate accuracy by looking just at the predicted outcome vs. the actual outcome. Here, accuracy is 75%! Does the model predict that the patient has malaria? Yes Yes Yes No Classify based upon probability threshold outcome feature predicted probability predicted outcome Y Y*Does the patient have malaria? No Yes Yes No What is the probability that the patient has malaria? 0.55 0.80 0.85 0.2 However, we are missing out on using probability, which is important information about how certain the model is about its prediction. Let’s look at some loss functions that utilize this metric.
  • 74. For every prediction the model Makes, we can measure the logarithmic loss. What is logarithmic loss? Source: https://www.r-bloggers.com/making-sense-of-logarithmic-loss/ Log loss I’m only 60% sure that the answer is Yes… Model Log loss evaluates the probability of belonging in a class - The smaller the log loss, the smaller the uncertainty, the better the model - A perfect classifier would have log loss = 0 - Log loss heavily penalizes classifiers that are confident about an incorrect classification - Ways to improve log loss: - Are there problematic errors in dataset? - Do we want to smooth the probabilities?
  • 75. For every prediction our model Makes, we can also measure the hinge loss. What is hinge loss?Hinge loss Hinge loss is the logical extension of the regression loss function, absolute loss. Absolute loss: Y- Y*, where Y and Y* are integers. Hinge loss: max(0,1-(Y*)(Y)) Where Y can equal -1 (no) or 1 (yes) for each class. For each observation, if Y* == Y (both are 1 or both are -1), hinge loss = 0. If Y =/= Y*, hinge loss increases. The cumulated hinge loss is therefore the upper bound of the number of mistakes made by the classifier. Sources: https://en.wikipedia.org/wiki/Hinge_loss; http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html
  • 76. How do we choose a loss function for a classification problem? Learning Methodology What is our loss function? Log Loss Hinge Lossvs. Leads to more exact probabilities, but at the cost of accuracy Leads to better accuracy, but at the cost of exact probabilities
  • 77. Depends on the question you want to answer! E.g. For a problem where we are trying to assess patient health, we know that false positives (the model predicts you do have malaria, but you actually don’t) are safer and generally more preferable than false negatives (the model predicts you don’t have malaria, but you actually do.) Therefore it is probably safer to evaluate our output as a probability of whether or not you have malaria. We will use log loss. How do we choose a loss function for a classification problem? Learning Methodology What is our loss function?
  • 78. We provide only a broad conceptual overview of log loss and hinge loss as we will not be using them in our coding lab. However, we encourage you to explore them further. More resources can be found at the end of this module. A note on categorical loss functions ...Learning Methodology What is our loss function?
  • 79. Our model computed an initial guess using RMSE. How does he improve on his initial guess?
  • 80. What is our loss function? Our initial RMSE is very high. Our model tries a different f(x) and compares RMSE. Learning Methodology Source: Intro to Stat -Introduction to Linear Regression If the new f(x) reduces the loss, our model keeps changing the f(x) in that direction. After every change, the model measures whether the loss has increased, decreased or stayed the same. Oh no! That wasn’t very good, let’s try something else! 1st initial guess 2nd update 3rd update RMSE 1,000 1,300 800 # nets 300 100 400 As the model updates the prediction at each step, we see that themodel is learning - it changes in response to a higher MSE.
  • 81. Learning Methodology How does our ML model learn? The process of changing f(x) to reduce the loss function is called learning. It is what makes ordinary least squares (OLS) regression a machine learning algorithm. For every f(x) we choose there is an associated loss. The learning process involves updating f(x) in order to reach the global minimum loss. Our model’s initial random f(x) gives us a high initial error
  • 82. Our model starts with a random f(x) and update f(x) to make our loss as small as possible. Learning Methodology Our model’s job is to change the parameters so that every time he changes f(x), the loss goes down. Our model succeeds when he reduces the error to its minimum. How does our ML model learn?
  • 83. When does our model stop?Learning Methodology Our model’s job is to change the parameters so that every time he changes f(x), the loss goes down. Our model succeeds when he reduces the error to its minimum. How does our ML model learn?
  • 84. What if we don’t have labelled data? Learning Methodology How does our ML model learn? Do you have labelled data? The outcome feature (Y) is not recorded in the data. You do not have a labelled Y. The outcome feature (Y) you are interested in predicting is recorded in the data. If you have a labelled Y, you can use supervised learning methods. Y=Number of people with malaria 2007: 80 2008: 40 2009: 42 2010: 35 Y=Number of people with malaria 2007: 2008: 2009: 2010: No Yes
  • 85. When does our model stop if we don’t have labelled data? Learning Methodology How does our ML model learn? Our model can update the parameters only if we have labelled data. What happens if we don’t actually have Y in our data? We turn to unsupervised learning techniques
  • 86. Unsupervised Algorithms ● For every x, there is no Y. ● We do not know the correct answers so we cannot act as a teacher. ● Instead, we try to get an understanding of the distribution of x to draw inference about Y. Source: Machine Learning Mastery - Supervised and Unsupervised Learning Learning Methodology Unsupervised learning does not have labelled data. However, it is the most promising current area of research in machine learning. Unlocking unsupervised learning will fundamentally change our world.
  • 87. To reach true machine intelligence, ML needs to get better at unsupervised learning. Humans learn mostly through unsupervised learning: we absorb vast amounts of data from our surroundings without needing a label. Learning Methodology Why is unsupervised learning important? Unsupervised learning is the cake itself Supervised learning is the icing on the cake Yan Lecun, a deep learning researcher, made the analogy that if intelligence was a cake, unsupervised learning would be the cake and supervised learning would be the icing on the cake. We know how to make the icing, but we don’t know how to make the cake. Unsupervised learning is the holy grail of machine learning.
  • 88. Learning Methodology Examples of unsupervised learning: This clustering algorithm predicts a user’s friends based upon their activity on social networks. Source: Quora - What is unsupervised learning with an example. We don’t have any labelled data that tells us any node is friends with another node (i.e., x is friends with Y). Instead, we can use user interactions to provide the labels. The assumption is that if you are interacting heavily with someone they are more likely to be your friend.
  • 89. Learning Methodology The clustering algorithm uses email patterns to predict the hierarchy of a business organization. Source: Automated Social Hierarchy Detection Through Email Network Analysis - Academic Paper This algorithm not only takes in account the people involved in an interaction but also the directionality of the interaction. Based on email interactions, who is the boss?
  • 90. In the next class, we will introduce model selection and evaluation.
  • 91. Let’s quickly recap what we have learnt in this module.
  • 92. Task Performance Measure What is the task and the performance measure for our malaria net example?
  • 93. Task Accurately predict the decrease in malaria cases for a given increase in mosquito nets Performance Measure Mean squared error What is the task and the performance measure for our malaria net example?
  • 94. Task Performance Measure What is the task and the performance measure for our malaria patient example?
  • 95. Task Accurately predict whether or not a given patient is likely to have malaria, given his or her temperature. Performance Measure Log Loss What is the task and the performance measure for our malaria patient example?
  • 96. You are on fire! Go straight to the next module here. Need to slow down and digest? Take a minute to write us an email about what you thought about the course. All feedback small or large welcome! Email: sara@deltanalytics.org
  • 97. Find out more about Delta’s machine learning for good mission here.
  • 99. Additional resources Rosasco, Lorenzo, et al. "Are loss functions all the same?." Neural Computation 16.5 (2004): 1063-1076. http://web.mit.edu/lrosasco/www/publications/loss.pdf “Loss Functions for Regression and Classification”, David Rosenberg, NYU: https://davidrosenberg.github.io/ml2015/docs/3a.loss-functions.pdf