SlideShare ist ein Scribd-Unternehmen logo
1 von 82
Downloaden Sie, um offline zu lesen
Unsupervised
Learning
This course content is being actively
developed by Delta Analytics, a 501(c)3
Bay Area nonprofit that aims to
empower communities to leverage their
data for good.
Please reach out with any questions or
feedback to inquiry@deltanalytics.org.
Find out more about our mission here.
Delta Analytics builds technical capacity
around the world.
Module 7:
Unsupervised learning
❏ Overview of unsupervised learning
❏ Intuition
❏ Pros and cons
❏ Clustering algorithms
❏ K-means clustering
❏ Principal component analysis (PCA)
Module Checklist:
In the first half of this course, we learned a lot about
supervised learning algorithms. Now, we turn to
unsupervised learning algorithms: what are they, when are
they useful, and what role do they play in the future of
machine learning?
Recap: Supervised v.
Unsupervised Learning
Supervised Learning
● For every x, there is a y
● Goal is to predict y using x
● Most methods used in practice are
supervised.
Unsupervised Learning
● For every x, there is no y
● Goal is not prediction, but to
investigate x.
● Unsupervised methods read data
first and then suggest which
classification scheme(s) might
apply.
Learning
Methodology
Data Set
Algorithm
Automated
Clusters
Unsupervised learning
involves identifying
patterns.
This is perhaps the most basic
human task - even babies can
do it!
Unsupervised
learning
Many unsupervised learning algorithms
involve identifying patterns.
Data Set
Algorithm
Automated
Clusters
It turns out that what’s
extremely intuitive for
human babies to do is
pretty difficult to have
computers do well.
Unsupervised
learning
Many unsupervised learning algorithms
involve identifying patterns.
Task
There is no
f(x).
Instead of trying to choose the best
f(x) to map x to the true Y, we just
want to understand raw patterns in x.
How is this model framework different than what
we’ve already seen?
Data Set
Algorithm
Automated
Clusters
f(X)
x
Y*
Task
There is no
f(x).
Instead of trying to choose the best
f(x) to map x to the true Y, we just
want to understand raw patterns in x.
Learning
Methodology
There is no
loss function.
Performance
We do not learn from the data by
trying to reduce our loss to 0.
Difficult to
evaluate
performance
Evaluating the performance of
clustering methods is notoriously
difficult.
How is this model framework different than what
we’ve already seen?
Task
What is the problem we want our
model to solve?
What is the
unsupervised
approach we
plan to use?
For example, is hierarchical
clustering better than k-means for
this task?
Feature
engineering
& selection
How do we decide what features to
include in our model?
Unsupervised algorithms still have a task, and involve
feature engineering and selection.
Learning
Methodology
Clustering algorithms are
unsupervised. How does that affect
the learning process?
Optimization
process
How do unsupervised learning
models optimize if there is no loss
function?
How does
our ML
model learn?
Overview of how the model
teaches itself.
Unsupervised algorithms are still machine learning
algorithms, which means they actively learn from the
data.
Performance
How do we evaluate the
performance of an unsupervised
algorithm?
With supervised learning we had a clear goal: predict our outcome
variable with high accuracy. Evaluating the performance of clustering
methods is difficult. Often, we rely on holistic evaluations of our clustering
algorithm.
Unlike with supervised learning, we cannot check the performance of
our model by comparing loss function. Clustering quality is subjective,
and largely dependent on the assumptions made at the outset.
Unsupervised algorithms are unfortunately very
difficult to evaluate for performance.
Pros and cons of using
an unsupervised
approach
When would I turn to
unsupervised learning?
● You have extremely high dimensional data (i.e., many features)
that you want to investigate
● You have a research question but no labelled outcome feature
○ This is true for many datasets
● You want to detect any relationships or patterns in your data
○ E.g. customer behavior data
● You don’t have time to dive deep into defining an outcome
○ Use unsupervised learning as first exploratory step
As the amount of data in the world grows, we will
increasingly turn to unsupervised learning methods.
Learning
Methodology
Unsupervised learning is an important tool,
often used as part of your exploratory
analysis of the data.
● Infer complex properties of the data
(such as sub groups)
● Discover interesting and informative
methods of visualization
● Often used in exploratory phases of
data analysis
Unsupervised
Learning
Research
Question
Exploratory
Analysis
Modeling
Phase
Performance
Data
Cleaning
You can often leverage an
unsupervised algorithm to aid
feature selection during
exploratory analysis, before using
a supervised algorithm during the
modeling phase.
Humans learn mostly through unsupervised
learning: we absorb vast amounts of data from
our surroundings without needing a label.
To reach true machine intelligence (i.e., a
machine that thinks and learns for itself), ML
needs to get better at unsupervised learning - it
should learn without us having to feed it labels
or explicit instructions.
Unsupervised
Learning
The future of machine learning is unsupervised learning.
Unsupervised
learning is the
cake itself
Supervised
learning is the
icing on the
cake
We will have only scratched the surface in this class.
With that said, let’s turn to our first unsupervised
algorithms: clustering algorithms.
Clustering Algorithms
1. K-means clustering
2. Principal component analysis
Central concept: identify similar sub-groups
within the data.
Where are we?
Supervised
Algorithms
Linear regression Decision tree
Ensemble
Algorithms
Unsupervised
Algorithms
k-means
Principal
Component
Analysis
We have finished discussing supervised
algorithms and now we will introduce and
discuss unsupervised algorithms. This will
help lay the foundation for our discussion
of natural language processing.
Unsupervised
Algorithms
Let’s start with K-means clustering, an algorithm that splits
data into k clusters along features that you select.
K-means clustering
Principal
Component
Analysis
1. K-means clustering
K Means Clustering: model cheat sheet
● Assumes existence of
underlying groupings
● Time-consuming to find
the optimal number of
clusters
● Time-consuming
feature engineering
(features must be
numeric and
normalized)
● Easy to represent
physically
● Does not assume any
underlying distribution
(e.g. no normal
distribution
assumption like in
linear regression)
● Produces intuitive
groupings
● Can work in many
dimensions
Pros Cons Assumptions
Clustering splits data in
order to find out how
observations are similar on a
number of different features.
Clustering
Task
Clustering is a powerful unsupervised algorithm that
detects naturally occurring patterns in the data.
model
Feature 2
Feature1
We are not predicting a
trueY.
The clusters are the model.
We decide the number of
clusters, represented as K.
Cluster 3
Cluster 2Cluster 1
Cluster 4
??? ???
Dataset of
customer
features
??????
Clustering
Task
Defining
Groups
Imagine that you are the owner of an
online clothing store. You want to
segment your customers by their buying
habits.
Clustering
Task
Defining
Groups
NOTE: As the owner of the store, you think that some customers are
bargain-hunters while others are price-insensitive; some come in every week
while others drop in during the holidays.
But you don’t actually know definitively which ones are which.
You could take a supervised approach to this problem, and try to predict how
much a customer will spend, or how frequently a customer will drop in. But
here, we’re just trying to see what the data will tell us. This is what makes
clustering an unsupervised problem.
Imagine that you are the owner of an
online clothing store. You want to
segment your customers by their buying
habits.
Dataset of
customer
features
Clustering
Task
Since we have an online website for our store, we have
valuable data about how our consumers behave online.
organic
Email_sale
Email_new_collection
organic
$92
$35
$200
$40
traffic typeAvg_amt_
spent
Number of
visits
5
50
20
1
%_of_visits_durin
g_sales
20%
100%
10%
100%
X1 X2 X3 X4
customer_id
1237482
1213345
2323764
2326734
What features will be relevant to our
clustering task?
Clustering
Task
We select features that will determine how clusters
are formed in our algorithm.
organic
Email_sale
Email_new_collection
organic
$92
$35
$200
$40
traffic typeAvg_amt_
spent
Number of
visits
5
50
20
1
%_of_visits_durin
g_sales
20%
100%
10%
100%
X1 X2 X3 X4
customer_id
1237482
1213345
2323764
2326734
Since we want to segment customers by their buying habits, we
probably want to form clusters using the features “number of
visits” and “average amount spent.” Let’s start with these.
Middle value Buyers Casual Buyers
Dataset of
customer
features
Low value BuyersHigh value Buyers
Clustering
Task
Defining
Groups
Using number of visits and amount
spent, we can see groups below:
We use avg_price_clicked_on and number_of_visits to cluster our customers.
Defining
groups
We can demonstrate clustering using
two features in two dimensional
space!
Task
model
Number
of visits
avg_amt_spent
Middle value
buyers
High value buyers
Low Value Buyers
Casual Buyers
Once we cluster using the features we select, we can say something about the value of
our customers.
Middle value
buyers
Casual BuyersLow Value BuyersHigh value buyers
Defining
groups
Using two features, we can say
something about our customers.
● Not price-sensitive
● Frequent buyers
● Price-sensitive
● Infrequent buyers
● Not price-sensitive
● Infrequent buyers
● Price-sensitive
● Frequent buyers
Task
Feature
Engineering
& Selection
Task
In this simple example, we used 2 features to cluster, and
produced 4 different clusters (K=4).
But we can include as many features as we want, and determine
how many clusters to produce. We will return to this point later,
but first, let’s find out how this algorithm learns.
Feature selection is just as
important in unsupervised as in
supervised algorithms.
How does our k-mean algorithm
learn from the data to group similar
observations together?
Learning
Methodology
Clustering algorithms are
unsupervised. How does that affect
the learning process?
Optimization
process
How does the model minimize the
loss function.
How does
our ML
model learn?
Overview of how the model
teaches itself.
Unsupervised algorithms are still machine learning
algorithm, which means they actively learn from the
data!
Numberofvisits
Learning
Methodology
How does
our ML
model learn?
We start with a simple scatter
plot of number of visits
against average amount spent.
How do we take this scatter plot
and segment observations into K
distinct groups?
For now, let’s say K = 3.
Source: Intro to Statistical Learning with
Applications in R.pdf
Average amount spent
Learning
Methodology
How does
our ML
model learn?
Step 1: Randomly assign each
customer to a cluster
● Data points have
randomly been assigned
into pink, yellow, and
green groupings
There are 3 colors
(groups) here. Why?
Numberofvisits
Average amount spent
Learning
Methodology
How does
our ML
model learn?
Step 1: Randomly assign each
customer to a cluster
● Data points have
randomly been assigned
into pink, yellow, and
green groupings
There are 3 colors
(groups) here because
we decided K=3.
Numberofvisits
Average amount spent
Why were our data points randomly assigned
to three groupings? We set K, or number of
groups, equal to 3
The number of clusters (k) is an example of a
hyperparameter.
Hyperparameters are set by the researcher (you!), not
determined by the model.
Learning
Methodology
Hyperparameters
Higher level settings of
a model that are fixed
before training begins.
In our example we
specified 3 clusters,
but we can tell the
model whatever “K”
we want.
Learning
Methodology
What should k be? A very large number of clusters will cause overfitting (for
example, if you set n equal to the number of observations you will have a
cluster for every observation!) This is not very useful for making sense of
subsets of our data.
We can set our hyperparameter K to be
2,3,4...n.
Performance
If we are underfitting, we may be missing
natural subsets of similar customers. If
we are overfitting, we may have too many
clusters, so our model does not generalize
well to unseen data.
Underfit Overfit
Sweet spot
Recall overfitting vs. overfitting? If K is too
large, we can have overfitting; too small and
we may be underfitting.
An elbow plot helps us avoid underfitting
by showing how model error decreases with
the number of clusters.
How do we know what the optimal
number of clusters should be?
An Elbow Plot visualizes how the error
decreases as K increases.
This plot is useful because it visualizes
the trade-off between overfitting and
underfitting. We need to find a
balance, called the “elbow point”.
Learning
Methodology
Optimization
Process
“elbow”
An elbow plot helps us avoid underfitting
by showing how model error decreases with
the number of clusters.
If K == n (number of observations), then
distance = 0 (each observation its own
cluster)! This, as we know, is a case of
overfitting.
In the graph to the right, we should
choose K=6. This appears to be the
number of clusters where the biggest
gains in reducing error have already
been made.
Learning
Methodology
Optimization
Process
“elbow”
To start with, let’s set k=3.
This means we want to find 3
clusters.
Learning
Methodology
How does
our ML
model learn?
We suspect these 3 clusters will correspond to:
1. Budget customers that shop with us
frequently
2. High end customers that shop with us
frequently
3. Customers that shop with us infrequently
However, it is possible and in fact likely that
there are naturally more than 3 subsets of
customers.
Numberofvisits
Average amount spent
Learning
Methodology
How does
our ML
model learn?
Is our initial random allocation
of observations to clusters
useful?
Remember our first step was to
randomly assign each customer
to a cluster.
Clearly, our initial random
assignment to clusters is poor.
How poor? And how do we
improve?
Numberofvisits
Average amount spent
Learning
Methodology
To evaluate how poor our clusters are, we
can use the distance between an
observation and its cluster’s centroid.
A centroid is the center point of a cluster.
Centroids
How does the
model learn?
● The “distance” here is the Euclidian distance (or spatial distance)
where the distance between two vectors u and v with n elements is:
● In this example, the difference between customer c6 (7,2) and the
cluster center (4,7) would be sqrt[ (7-4)^2 + (2-7)^2 ] = 5.8.
Important note: Clustering does not take categorical features as inputs,
only continuous. Distance between categorical points would not be
meaningful.
Learning
Methodology
How does the
model learn?
Review! How do we calculate
distance between points?
Euclidean
Learning
Methodology
We calculate a centroid for each
cluster. Our goal is to minimize the
distance from centroid to any
observation.
● The centroids are calculated as the
center of their randomly assigned
group.
● Here, centroids are really close
together, and distance to the outer
points is very large -- we can
definitely do better!
Goal is to minimize distance
from centroid to any of the
observations in its cluster
Numberofvisits
Average amount spent
How does
the model
learn?
Learning
Methodology
Step 2b: Re-assign each
observation to the centroid it is
nearest
● Now we re-assign each
observation to the cluster
group of the nearest
centroid. Now our clusters
are starting to look better!
Of the three centroids, this observation is
closest to the pink, so we assign it to the
pink cluster
Numberofvisits
Average amount spent
How does the
model learn?
Learning
Methodology
Iteration 2 Step 2a: Repeat
centroid calculations!
● Here we go again! Our
recalculated centroid are
now further apart, and we
can see the distance
minimization between
observations and centroids
Overall distances are much smaller --
we’re getting closer!
Numberofvisits
Average amount spent
How does the
model learn?
Learning
Methodology
Optimization
process
Iteration 2 Step 2b: Re-assign
clusters based on nearest centroid
● This time around, fewer
observations changed
clusters.
● We keep repeating the
centroid calculation / cluster
re-assignments until
nothing moves anymore -
the model is complete!
Numberofvisits
Average amount spent
Now that we know the mechanics of the K-means algorithm,
let’s think through an example that you can recreate in the Kiva
data.
How do we cluster types of loans requested on the Kiva
website?
Clustering
Exercise 1
Clustering task:
Loan 1:
$25
1 borrower
Funded in 1 day
Loan 2:
$1000
2 borrowers
Funded in 30 days
Loan 9:
$50
1 borrower
Funded in 1 day
Loan 10:
$25
1 borrower
Funded in 1 day
Loan 5:
$2000
5 borrower
Funded in 1 day
Loan 3:
$50
3 borrowers
Funded in 30 days
Loan 6:
$1000
1 borrower
Funded in 1 day
Loan 7:
$75
1 borrower
Funded in 60 days
Loan 4:
$25
1 borrower
Funded in 1
day
Loan 8:
$25
1 borrower
Funded in 1 day
How would we group these loans into groups most similar
in characteristics?
Loan 11:
$3500
3 borrowers
Funded in 40 days
Clustering
Exercise 1
We can use the features “time to
fund” and “size of loan”
Loan 1:
$25
1 borrower
Funded in 1 day
Loan 2:
$1000
2 borrowers
Funded in 30 days
Loan 9:
$500
4 borrower
Funded in 2 days
Loan 10:
$25
2 borrower
Funded in 1 day
Loan 5:
$2000
5 borrower
Funded in 1 day
Loan 3:
$50
3 borrowers
Funded in 30 days
Loan 6:
$1000
1 borrower
Funded in 1 day
Loan 7:
$75
1 borrower
Funded in 60 days
Loan 4:
$30
2 borrower
Funded in 1 day
Loan 8:
$25
1 borrower
Funded in 1 day
Big loans
many
borrowers
funded
quickly Long time to fund
Small loans few
borrowers
funded quickly
Loan 11:
$3500
3 borrowers
Funded in 40 days
Clustering
Task
Feature
Engineering
& Selection
What if we want to use more
than two features to cluster?
We want to group similar loans using:
○ Loan Amount - How much $ did the borrower
request?
○ Borrower Count - Is this a request from a single
borrower or a group of borrowers?
○ Time to Fund - How long did it take for the request
to be funded on the site?
So far, our clusters have been
visualized in 2D for simplicity.
But this can be extended to n
dimensions. The plot to the
right uses 3 features to
cluster, this results in a 3D
visualization.
We can extend our clustering to higher
dimensions by adding more features.
Task
The true utility of clustering comes from
making sense of high dimensional data.
Clustering
Task
Feature
Engineering
& Selection
Example of clustering using 3 features:
time to fund, the number of borrowers
and the loan amount.
Last-minute
funded loans
Small loans with
few borrowers
that fund quickly
Big loans that
fund quickly
Clustering Features:
● Time to Fund
● Borrower Count
● Loan Amount
Fascinating! If you were the Kiva CEO, what would you do with this
information?
We can change up how many features we use to cluster our data,
but we can also choose the number of clusters (k) we want to
see in our data.
In this example, we see three separate clusters…Exercise 2
Loan 1:
$25
1 borrower
Funded in 1 day
Loan 2:
$1000
2 borrowers
Funded in 30 days
Loan 9:
$500
4 borrower
Funded in 2 days
Loan 10:
$25
2 borrower
Funded in 1 day
Loan 5:
$2000
5 borrower
Funded in 1 day
Loan 3:
$50
3 borrowers
Funded in 30 days
Loan 6:
$1000
1 borrower
Funded in 1 day
Loan 7:
$75
1 borrower
Funded in 60 days
Loan 4:
$30
2 borrower
Funded in 1 day
Loan 8:
$25
1 borrower
Funded in 1 day
Loan 11:
$3500
3 borrowers
Funded in 40 days
Exercise 2
Loan 1:
$25
1 borrower
Funded in 1 day
Loan 2:
$1000
2 borrowers
Funded in 30 days
Loan 9:
$500
4 borrower
Funded in 2 days
Loan 10:
$25
2 borrower
Funded in 1 day
Loan 5:
$2000
5 borrower
Funded in 1 day
Loan 3:
$50
3 borrowers
Funded in 30 days
Loan 6:
$1000
1 borrower
Funded in 1 day
Loan 7:
$75
1 borrower
Funded in 60 days
Loan 4:
$30
2 borrower
Funded in 1 day
Loan 8:
$25
1 borrower
Funded in 1 day
Loan 11:
$3500
3 borrowers
Funded in 40 days
Big loans
many
borrowers
funded
quickly
Small Loans
Long time to
fund
Big Loans
Long time to
fund
Small loans few
borrowers
funded quickly
What if we thought there were 4 distinct groupings?
K-means clustering is useful when we want to identify groups in
our data based on similarities across specific features.
But what if you have 250 features? It’s difficult to wrap your head
around clusters based on this many features.
Principal Component Analysis can help! PCA is an
unsupervised algorithm that helps determine which features
have the most explanatory power, and are therefore the most
important to include.
2. Principal Component
Analysis
PCA finds which features are most correlated in a
dataset, and removes them, leaving you with the most
“important” features - i.e., the “principal components.”
What is Principal Component Analysis?
PCA is helpful for feature selection and engineering
Consider the following mini dataset:
If you wanted to choose a single feature from this dataset to
represent the entire dataset, you would want to select the one that
contains the most information. Here, the feature “cumulative_gpa”
intuitively contains more information than “last_years_gpa.”
This is the fundamental idea of PCA: select the features that
contain the most information.
What is Principal Component Analysis?
cumulative_gpa last_year_gpa test_scores attendance tutoringYN
3.55 3.65 89% 91% Y
This is why PCA is sometimes called a “dimension reduction” technique: by
seeing which features are important, we can remove the unimportant
features. Recall the difference between low and high dimension data:
PCA tells us which features are important and
informative when we have very high dimensional
data!
PCA is a dimension reduction
technique.
x y z a b ...
John 31 M 21st St. CH ...
Jane 42 F 3rd Ave KE ...
# rooms House price
1 32,000
3 100,000
4 232,000
2 50,000
... ...
PCA Task
Dimension
reduction
low high
When faced with a gigantic, high-dimensional dataset, you could pick and
choose the features you think are important.
But this can be inefficient, and even worse, can lead to us missing out on
potentially important data! PCA can provide a rigorous way of determining
which features are important.
PCA is more rigorous than picking
and choosing features yourselfPCA Task
Dimension
reduction
studentid cumulative_gpa last_year_gpa test_scores attendance tutoringYN
1 3.55 3.65 89% 91% Y ...
2 2.76 2.50 73% 90% Y ...
... ... ... ... ... ... ...
How does PCA work?
How does the
model learn?
Learning
Methodology
So how do we determine which features contain the most
information? We measure variation (or, how different a
feature is from another) as a proxy for information.
studentid cumulative_gpa last_year_gpa test_scores attendance tutoringYN
1 3.55 3.65 89% 91% Y
2 2.76 2.50 73% 90% Y
... ... ... ... ... ...
How does variation relate to
information?
How does the
model learn?
Learning
Methodology
Province
Country
Why does measuring variation make sense? Recall our discussion of
multicollinearity back in Module 4, Linear Regression. When two features are
correlated, it doesn’t make sense to include them both in a model, because you
only need one to capture the information of both.
In PCA, we use a similar concept to detect and remove
redundant and non-informative features. If a feature
contributes very little variation, it can be removed.
See Module 4, Linear Regression
Let’s step through the
algorithm…
How does the
model learn?
Learning
Methodology
Assume a simplified dataset with three features: Xa
, Xb
and Xc
.
1. We must first standardize the data. Let the new
standardized features be XA
, XB
and XC
.
For each datapoint x for each feature X, subtract the mean
of X and divide by the standard deviation of X.
Why? Because we will be measuring variation. If we do not standardize
data, we can convert one feature from km to cm, and cause that feature’s
variance to increase. Standardization ensures that our data variance is
independent of whatever transformations we might use.
[1] For more on why we need to standardize data, read here.
XA
XB
XC
Finding Principal Component #1
How does the
model learn?
2. To find the first Principal Component, find the
direction of maximum variance. This is Principal
Component #1 (PC1)!
Formally, PC1 is the linear combination of the features…
PC1 = c1
(XA
) + c2
(XB
) + c3
(XC
)
… that has the largest variance.
In our simplified example, we eyeball the direction in
which the data varies most.
Learning
Methodology
X1
X2
X3
Finding Principal Component #2
How does the
model learn?
Learning
Methodology
3. To find the second Principal Component, find the
second highest direction of variance. This is Principal
Component #2 (PC2)!
Formally, PC2 is the linear combination of the features…
PC2 = c4
(XA
) + c5
(XB
) + c6
(XC
)
… that has the second largest variance AND is
uncorrelated with PC1.
In our simplified example, we can roughly eyeball the
direction in which the variance is second highest.
X1
X2
X3
Finding Principal Component #3
How does the
model learn?
Learning
Methodology
4. To find the third and final Principal Component, find
the third highest direction of variance. This is Principal
Component #3 (PC3)!
Formally, PC3 is the linear combination of the features…
PC3 = c7
(XA
) + c8
(XB
) + c9
(XC
)
… that has the third largest variance AND is
uncorrelated with both PC1 and PC2.
In our simplified example, we can roughly eyeball the
direction in which the variance is third highest.
X1
X2
X3
How many principal components
can you have? How many do we
want?
For the sake of completeness, we calculated all the
possible principal components (there can be as many
principal components as features.)
However, as our main aim is dimensionality reduction,
we will keep the top principal components.
How many should we keep?
Recall our discussion of the elbow plot in k-means
clustering. We use similar logic to see how many
principal components we want.
How does the
model learn?
Learning
Methodology
X1
X2
X3
How many principal components
can you have? How many do we
want?
If we choose to keep only PC1, we will be
able to capture 80% of the original data’s
information. Pretty good!
If we choose to keep both PC1 and PC2, we
will be able to capture 95% of the original
data’s information. Even better!
Note that if we chose to keep PC1, PC2, and PC3, we would capture
100% of the original data’s information. However, we wouldn’t
choose to do this, as our main intention was dimension reduction.
How does the
model learn?
Learning
Methodology
Number of components
%Varianceexplained
1 2 3
15%
80%
5%
For simplicity, our example had a dataset that comprised only 3
features. In reality, you probably wouldn’t apply PCA to such a
simple dataset. The true utility of PCA comes from when we
apply it to datasets with hundreds or thousands of features.
Consider, for example, image processing, where every pixel of
an image constitutes a feature.
For each of these 321 × 261 pixel images, every pixel is a feature,
resulting in 321 * 261 = 83,781 features for only 4 observations.
This is a very high-dimensional dataset,
but we can use PCA to simplify it.
Source: http://people.ciirc.cvut.cz/~hlavac/TeachPresEn/11ImageProc/15PCA.pdf
With PCA, we are able to reduce a dataset with 83,781 features and 4 observations
to a dataset with 4 principal components and 4 observations.
Compare the original images... … to images recreated using 4 principal
components
PCA is able to retain most of the information from the original
dataset, while condensing the data quite a bit.
Source: http://people.ciirc.cvut.cz/~hlavac/TeachPresEn/11ImageProc/15PCA.pdf
End of theory
Advanced
resources
Want to take this further? Here are
some resources we recommend:
● Textbooks
○ Unsupervised learning, Introduction to Statistical
Analysis, Chapter 10.2
○ Hierarchical clustering, Introduction to Statistical
Analysis, Chapter 10.3.2
● Web resources
○ Clustering optimization using Silhouette plots
○ More applications of k-means: Anomaly detection
You are on fire! Go straight to the
next module here.
Need to slow down and digest? Take a
minute to write us an email about
what you thought about the course. All
feedback small or large welcome!
Email: sara@deltanalytics.org
Congrats! You finished
module 7!
Find out more about
Delta’s machine
learning for good
mission here.

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 
Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | Edureka
Edureka!
 

Was ist angesagt? (18)

Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and types
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and Answers
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Introduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regressionIntroduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regression
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | Edureka
 

Ähnlich wie Module 7: Unsupervised Learning

Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
iaeronlineexm
 
Types of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike MoinTypes of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike Moin
Tanvir Moin
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
Ajit Ghodke
 

Ähnlich wie Module 7: Unsupervised Learning (20)

Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
Types of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike MoinTypes of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike Moin
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptx
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
 
Introduction to Machine Learning.pptx
Introduction to Machine Learning.pptxIntroduction to Machine Learning.pptx
Introduction to Machine Learning.pptx
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
Overview of machine learning
Overview of machine learning Overview of machine learning
Overview of machine learning
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Module 7: Unsupervised Learning

  • 2. This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good. Please reach out with any questions or feedback to inquiry@deltanalytics.org. Find out more about our mission here. Delta Analytics builds technical capacity around the world.
  • 4. ❏ Overview of unsupervised learning ❏ Intuition ❏ Pros and cons ❏ Clustering algorithms ❏ K-means clustering ❏ Principal component analysis (PCA) Module Checklist:
  • 5. In the first half of this course, we learned a lot about supervised learning algorithms. Now, we turn to unsupervised learning algorithms: what are they, when are they useful, and what role do they play in the future of machine learning?
  • 6. Recap: Supervised v. Unsupervised Learning Supervised Learning ● For every x, there is a y ● Goal is to predict y using x ● Most methods used in practice are supervised. Unsupervised Learning ● For every x, there is no y ● Goal is not prediction, but to investigate x. ● Unsupervised methods read data first and then suggest which classification scheme(s) might apply. Learning Methodology
  • 7. Data Set Algorithm Automated Clusters Unsupervised learning involves identifying patterns. This is perhaps the most basic human task - even babies can do it! Unsupervised learning Many unsupervised learning algorithms involve identifying patterns.
  • 8. Data Set Algorithm Automated Clusters It turns out that what’s extremely intuitive for human babies to do is pretty difficult to have computers do well. Unsupervised learning Many unsupervised learning algorithms involve identifying patterns.
  • 9. Task There is no f(x). Instead of trying to choose the best f(x) to map x to the true Y, we just want to understand raw patterns in x. How is this model framework different than what we’ve already seen? Data Set Algorithm Automated Clusters f(X) x Y*
  • 10. Task There is no f(x). Instead of trying to choose the best f(x) to map x to the true Y, we just want to understand raw patterns in x. Learning Methodology There is no loss function. Performance We do not learn from the data by trying to reduce our loss to 0. Difficult to evaluate performance Evaluating the performance of clustering methods is notoriously difficult. How is this model framework different than what we’ve already seen?
  • 11. Task What is the problem we want our model to solve? What is the unsupervised approach we plan to use? For example, is hierarchical clustering better than k-means for this task? Feature engineering & selection How do we decide what features to include in our model? Unsupervised algorithms still have a task, and involve feature engineering and selection.
  • 12. Learning Methodology Clustering algorithms are unsupervised. How does that affect the learning process? Optimization process How do unsupervised learning models optimize if there is no loss function? How does our ML model learn? Overview of how the model teaches itself. Unsupervised algorithms are still machine learning algorithms, which means they actively learn from the data.
  • 13. Performance How do we evaluate the performance of an unsupervised algorithm? With supervised learning we had a clear goal: predict our outcome variable with high accuracy. Evaluating the performance of clustering methods is difficult. Often, we rely on holistic evaluations of our clustering algorithm. Unlike with supervised learning, we cannot check the performance of our model by comparing loss function. Clustering quality is subjective, and largely dependent on the assumptions made at the outset. Unsupervised algorithms are unfortunately very difficult to evaluate for performance.
  • 14. Pros and cons of using an unsupervised approach
  • 15. When would I turn to unsupervised learning? ● You have extremely high dimensional data (i.e., many features) that you want to investigate ● You have a research question but no labelled outcome feature ○ This is true for many datasets ● You want to detect any relationships or patterns in your data ○ E.g. customer behavior data ● You don’t have time to dive deep into defining an outcome ○ Use unsupervised learning as first exploratory step As the amount of data in the world grows, we will increasingly turn to unsupervised learning methods. Learning Methodology
  • 16. Unsupervised learning is an important tool, often used as part of your exploratory analysis of the data. ● Infer complex properties of the data (such as sub groups) ● Discover interesting and informative methods of visualization ● Often used in exploratory phases of data analysis Unsupervised Learning Research Question Exploratory Analysis Modeling Phase Performance Data Cleaning You can often leverage an unsupervised algorithm to aid feature selection during exploratory analysis, before using a supervised algorithm during the modeling phase.
  • 17. Humans learn mostly through unsupervised learning: we absorb vast amounts of data from our surroundings without needing a label. To reach true machine intelligence (i.e., a machine that thinks and learns for itself), ML needs to get better at unsupervised learning - it should learn without us having to feed it labels or explicit instructions. Unsupervised Learning The future of machine learning is unsupervised learning. Unsupervised learning is the cake itself Supervised learning is the icing on the cake We will have only scratched the surface in this class.
  • 18. With that said, let’s turn to our first unsupervised algorithms: clustering algorithms.
  • 19. Clustering Algorithms 1. K-means clustering 2. Principal component analysis Central concept: identify similar sub-groups within the data.
  • 20. Where are we? Supervised Algorithms Linear regression Decision tree Ensemble Algorithms Unsupervised Algorithms k-means Principal Component Analysis We have finished discussing supervised algorithms and now we will introduce and discuss unsupervised algorithms. This will help lay the foundation for our discussion of natural language processing.
  • 21. Unsupervised Algorithms Let’s start with K-means clustering, an algorithm that splits data into k clusters along features that you select. K-means clustering Principal Component Analysis
  • 23. K Means Clustering: model cheat sheet ● Assumes existence of underlying groupings ● Time-consuming to find the optimal number of clusters ● Time-consuming feature engineering (features must be numeric and normalized) ● Easy to represent physically ● Does not assume any underlying distribution (e.g. no normal distribution assumption like in linear regression) ● Produces intuitive groupings ● Can work in many dimensions Pros Cons Assumptions
  • 24. Clustering splits data in order to find out how observations are similar on a number of different features. Clustering Task Clustering is a powerful unsupervised algorithm that detects naturally occurring patterns in the data. model Feature 2 Feature1 We are not predicting a trueY. The clusters are the model. We decide the number of clusters, represented as K. Cluster 3 Cluster 2Cluster 1 Cluster 4
  • 25. ??? ??? Dataset of customer features ?????? Clustering Task Defining Groups Imagine that you are the owner of an online clothing store. You want to segment your customers by their buying habits.
  • 26. Clustering Task Defining Groups NOTE: As the owner of the store, you think that some customers are bargain-hunters while others are price-insensitive; some come in every week while others drop in during the holidays. But you don’t actually know definitively which ones are which. You could take a supervised approach to this problem, and try to predict how much a customer will spend, or how frequently a customer will drop in. But here, we’re just trying to see what the data will tell us. This is what makes clustering an unsupervised problem. Imagine that you are the owner of an online clothing store. You want to segment your customers by their buying habits.
  • 27. Dataset of customer features Clustering Task Since we have an online website for our store, we have valuable data about how our consumers behave online. organic Email_sale Email_new_collection organic $92 $35 $200 $40 traffic typeAvg_amt_ spent Number of visits 5 50 20 1 %_of_visits_durin g_sales 20% 100% 10% 100% X1 X2 X3 X4 customer_id 1237482 1213345 2323764 2326734 What features will be relevant to our clustering task?
  • 28. Clustering Task We select features that will determine how clusters are formed in our algorithm. organic Email_sale Email_new_collection organic $92 $35 $200 $40 traffic typeAvg_amt_ spent Number of visits 5 50 20 1 %_of_visits_durin g_sales 20% 100% 10% 100% X1 X2 X3 X4 customer_id 1237482 1213345 2323764 2326734 Since we want to segment customers by their buying habits, we probably want to form clusters using the features “number of visits” and “average amount spent.” Let’s start with these.
  • 29. Middle value Buyers Casual Buyers Dataset of customer features Low value BuyersHigh value Buyers Clustering Task Defining Groups Using number of visits and amount spent, we can see groups below:
  • 30. We use avg_price_clicked_on and number_of_visits to cluster our customers. Defining groups We can demonstrate clustering using two features in two dimensional space! Task model Number of visits avg_amt_spent Middle value buyers High value buyers Low Value Buyers Casual Buyers
  • 31. Once we cluster using the features we select, we can say something about the value of our customers. Middle value buyers Casual BuyersLow Value BuyersHigh value buyers Defining groups Using two features, we can say something about our customers. ● Not price-sensitive ● Frequent buyers ● Price-sensitive ● Infrequent buyers ● Not price-sensitive ● Infrequent buyers ● Price-sensitive ● Frequent buyers Task
  • 32. Feature Engineering & Selection Task In this simple example, we used 2 features to cluster, and produced 4 different clusters (K=4). But we can include as many features as we want, and determine how many clusters to produce. We will return to this point later, but first, let’s find out how this algorithm learns. Feature selection is just as important in unsupervised as in supervised algorithms.
  • 33. How does our k-mean algorithm learn from the data to group similar observations together?
  • 34. Learning Methodology Clustering algorithms are unsupervised. How does that affect the learning process? Optimization process How does the model minimize the loss function. How does our ML model learn? Overview of how the model teaches itself. Unsupervised algorithms are still machine learning algorithm, which means they actively learn from the data!
  • 35. Numberofvisits Learning Methodology How does our ML model learn? We start with a simple scatter plot of number of visits against average amount spent. How do we take this scatter plot and segment observations into K distinct groups? For now, let’s say K = 3. Source: Intro to Statistical Learning with Applications in R.pdf Average amount spent
  • 36. Learning Methodology How does our ML model learn? Step 1: Randomly assign each customer to a cluster ● Data points have randomly been assigned into pink, yellow, and green groupings There are 3 colors (groups) here. Why? Numberofvisits Average amount spent
  • 37. Learning Methodology How does our ML model learn? Step 1: Randomly assign each customer to a cluster ● Data points have randomly been assigned into pink, yellow, and green groupings There are 3 colors (groups) here because we decided K=3. Numberofvisits Average amount spent
  • 38. Why were our data points randomly assigned to three groupings? We set K, or number of groups, equal to 3 The number of clusters (k) is an example of a hyperparameter. Hyperparameters are set by the researcher (you!), not determined by the model. Learning Methodology Hyperparameters Higher level settings of a model that are fixed before training begins.
  • 39. In our example we specified 3 clusters, but we can tell the model whatever “K” we want. Learning Methodology What should k be? A very large number of clusters will cause overfitting (for example, if you set n equal to the number of observations you will have a cluster for every observation!) This is not very useful for making sense of subsets of our data. We can set our hyperparameter K to be 2,3,4...n.
  • 40. Performance If we are underfitting, we may be missing natural subsets of similar customers. If we are overfitting, we may have too many clusters, so our model does not generalize well to unseen data. Underfit Overfit Sweet spot Recall overfitting vs. overfitting? If K is too large, we can have overfitting; too small and we may be underfitting.
  • 41. An elbow plot helps us avoid underfitting by showing how model error decreases with the number of clusters. How do we know what the optimal number of clusters should be? An Elbow Plot visualizes how the error decreases as K increases. This plot is useful because it visualizes the trade-off between overfitting and underfitting. We need to find a balance, called the “elbow point”. Learning Methodology Optimization Process “elbow”
  • 42. An elbow plot helps us avoid underfitting by showing how model error decreases with the number of clusters. If K == n (number of observations), then distance = 0 (each observation its own cluster)! This, as we know, is a case of overfitting. In the graph to the right, we should choose K=6. This appears to be the number of clusters where the biggest gains in reducing error have already been made. Learning Methodology Optimization Process “elbow”
  • 43. To start with, let’s set k=3. This means we want to find 3 clusters. Learning Methodology How does our ML model learn? We suspect these 3 clusters will correspond to: 1. Budget customers that shop with us frequently 2. High end customers that shop with us frequently 3. Customers that shop with us infrequently However, it is possible and in fact likely that there are naturally more than 3 subsets of customers. Numberofvisits Average amount spent
  • 44. Learning Methodology How does our ML model learn? Is our initial random allocation of observations to clusters useful? Remember our first step was to randomly assign each customer to a cluster. Clearly, our initial random assignment to clusters is poor. How poor? And how do we improve? Numberofvisits Average amount spent
  • 45. Learning Methodology To evaluate how poor our clusters are, we can use the distance between an observation and its cluster’s centroid. A centroid is the center point of a cluster. Centroids How does the model learn?
  • 46. ● The “distance” here is the Euclidian distance (or spatial distance) where the distance between two vectors u and v with n elements is: ● In this example, the difference between customer c6 (7,2) and the cluster center (4,7) would be sqrt[ (7-4)^2 + (2-7)^2 ] = 5.8. Important note: Clustering does not take categorical features as inputs, only continuous. Distance between categorical points would not be meaningful. Learning Methodology How does the model learn? Review! How do we calculate distance between points? Euclidean
  • 47. Learning Methodology We calculate a centroid for each cluster. Our goal is to minimize the distance from centroid to any observation. ● The centroids are calculated as the center of their randomly assigned group. ● Here, centroids are really close together, and distance to the outer points is very large -- we can definitely do better! Goal is to minimize distance from centroid to any of the observations in its cluster Numberofvisits Average amount spent How does the model learn?
  • 48. Learning Methodology Step 2b: Re-assign each observation to the centroid it is nearest ● Now we re-assign each observation to the cluster group of the nearest centroid. Now our clusters are starting to look better! Of the three centroids, this observation is closest to the pink, so we assign it to the pink cluster Numberofvisits Average amount spent How does the model learn?
  • 49. Learning Methodology Iteration 2 Step 2a: Repeat centroid calculations! ● Here we go again! Our recalculated centroid are now further apart, and we can see the distance minimization between observations and centroids Overall distances are much smaller -- we’re getting closer! Numberofvisits Average amount spent How does the model learn?
  • 50. Learning Methodology Optimization process Iteration 2 Step 2b: Re-assign clusters based on nearest centroid ● This time around, fewer observations changed clusters. ● We keep repeating the centroid calculation / cluster re-assignments until nothing moves anymore - the model is complete! Numberofvisits Average amount spent
  • 51. Now that we know the mechanics of the K-means algorithm, let’s think through an example that you can recreate in the Kiva data. How do we cluster types of loans requested on the Kiva website?
  • 52. Clustering Exercise 1 Clustering task: Loan 1: $25 1 borrower Funded in 1 day Loan 2: $1000 2 borrowers Funded in 30 days Loan 9: $50 1 borrower Funded in 1 day Loan 10: $25 1 borrower Funded in 1 day Loan 5: $2000 5 borrower Funded in 1 day Loan 3: $50 3 borrowers Funded in 30 days Loan 6: $1000 1 borrower Funded in 1 day Loan 7: $75 1 borrower Funded in 60 days Loan 4: $25 1 borrower Funded in 1 day Loan 8: $25 1 borrower Funded in 1 day How would we group these loans into groups most similar in characteristics? Loan 11: $3500 3 borrowers Funded in 40 days
  • 53. Clustering Exercise 1 We can use the features “time to fund” and “size of loan” Loan 1: $25 1 borrower Funded in 1 day Loan 2: $1000 2 borrowers Funded in 30 days Loan 9: $500 4 borrower Funded in 2 days Loan 10: $25 2 borrower Funded in 1 day Loan 5: $2000 5 borrower Funded in 1 day Loan 3: $50 3 borrowers Funded in 30 days Loan 6: $1000 1 borrower Funded in 1 day Loan 7: $75 1 borrower Funded in 60 days Loan 4: $30 2 borrower Funded in 1 day Loan 8: $25 1 borrower Funded in 1 day Big loans many borrowers funded quickly Long time to fund Small loans few borrowers funded quickly Loan 11: $3500 3 borrowers Funded in 40 days
  • 54. Clustering Task Feature Engineering & Selection What if we want to use more than two features to cluster? We want to group similar loans using: ○ Loan Amount - How much $ did the borrower request? ○ Borrower Count - Is this a request from a single borrower or a group of borrowers? ○ Time to Fund - How long did it take for the request to be funded on the site?
  • 55. So far, our clusters have been visualized in 2D for simplicity. But this can be extended to n dimensions. The plot to the right uses 3 features to cluster, this results in a 3D visualization. We can extend our clustering to higher dimensions by adding more features. Task The true utility of clustering comes from making sense of high dimensional data.
  • 56. Clustering Task Feature Engineering & Selection Example of clustering using 3 features: time to fund, the number of borrowers and the loan amount. Last-minute funded loans Small loans with few borrowers that fund quickly Big loans that fund quickly Clustering Features: ● Time to Fund ● Borrower Count ● Loan Amount Fascinating! If you were the Kiva CEO, what would you do with this information?
  • 57. We can change up how many features we use to cluster our data, but we can also choose the number of clusters (k) we want to see in our data.
  • 58. In this example, we see three separate clusters…Exercise 2 Loan 1: $25 1 borrower Funded in 1 day Loan 2: $1000 2 borrowers Funded in 30 days Loan 9: $500 4 borrower Funded in 2 days Loan 10: $25 2 borrower Funded in 1 day Loan 5: $2000 5 borrower Funded in 1 day Loan 3: $50 3 borrowers Funded in 30 days Loan 6: $1000 1 borrower Funded in 1 day Loan 7: $75 1 borrower Funded in 60 days Loan 4: $30 2 borrower Funded in 1 day Loan 8: $25 1 borrower Funded in 1 day Loan 11: $3500 3 borrowers Funded in 40 days
  • 59. Exercise 2 Loan 1: $25 1 borrower Funded in 1 day Loan 2: $1000 2 borrowers Funded in 30 days Loan 9: $500 4 borrower Funded in 2 days Loan 10: $25 2 borrower Funded in 1 day Loan 5: $2000 5 borrower Funded in 1 day Loan 3: $50 3 borrowers Funded in 30 days Loan 6: $1000 1 borrower Funded in 1 day Loan 7: $75 1 borrower Funded in 60 days Loan 4: $30 2 borrower Funded in 1 day Loan 8: $25 1 borrower Funded in 1 day Loan 11: $3500 3 borrowers Funded in 40 days Big loans many borrowers funded quickly Small Loans Long time to fund Big Loans Long time to fund Small loans few borrowers funded quickly What if we thought there were 4 distinct groupings?
  • 60. K-means clustering is useful when we want to identify groups in our data based on similarities across specific features. But what if you have 250 features? It’s difficult to wrap your head around clusters based on this many features. Principal Component Analysis can help! PCA is an unsupervised algorithm that helps determine which features have the most explanatory power, and are therefore the most important to include.
  • 62. PCA finds which features are most correlated in a dataset, and removes them, leaving you with the most “important” features - i.e., the “principal components.” What is Principal Component Analysis? PCA is helpful for feature selection and engineering
  • 63. Consider the following mini dataset: If you wanted to choose a single feature from this dataset to represent the entire dataset, you would want to select the one that contains the most information. Here, the feature “cumulative_gpa” intuitively contains more information than “last_years_gpa.” This is the fundamental idea of PCA: select the features that contain the most information. What is Principal Component Analysis? cumulative_gpa last_year_gpa test_scores attendance tutoringYN 3.55 3.65 89% 91% Y
  • 64. This is why PCA is sometimes called a “dimension reduction” technique: by seeing which features are important, we can remove the unimportant features. Recall the difference between low and high dimension data: PCA tells us which features are important and informative when we have very high dimensional data! PCA is a dimension reduction technique. x y z a b ... John 31 M 21st St. CH ... Jane 42 F 3rd Ave KE ... # rooms House price 1 32,000 3 100,000 4 232,000 2 50,000 ... ... PCA Task Dimension reduction low high
  • 65. When faced with a gigantic, high-dimensional dataset, you could pick and choose the features you think are important. But this can be inefficient, and even worse, can lead to us missing out on potentially important data! PCA can provide a rigorous way of determining which features are important. PCA is more rigorous than picking and choosing features yourselfPCA Task Dimension reduction studentid cumulative_gpa last_year_gpa test_scores attendance tutoringYN 1 3.55 3.65 89% 91% Y ... 2 2.76 2.50 73% 90% Y ... ... ... ... ... ... ... ...
  • 66. How does PCA work? How does the model learn? Learning Methodology So how do we determine which features contain the most information? We measure variation (or, how different a feature is from another) as a proxy for information. studentid cumulative_gpa last_year_gpa test_scores attendance tutoringYN 1 3.55 3.65 89% 91% Y 2 2.76 2.50 73% 90% Y ... ... ... ... ... ...
  • 67. How does variation relate to information? How does the model learn? Learning Methodology Province Country Why does measuring variation make sense? Recall our discussion of multicollinearity back in Module 4, Linear Regression. When two features are correlated, it doesn’t make sense to include them both in a model, because you only need one to capture the information of both. In PCA, we use a similar concept to detect and remove redundant and non-informative features. If a feature contributes very little variation, it can be removed. See Module 4, Linear Regression
  • 68. Let’s step through the algorithm… How does the model learn? Learning Methodology Assume a simplified dataset with three features: Xa , Xb and Xc . 1. We must first standardize the data. Let the new standardized features be XA , XB and XC . For each datapoint x for each feature X, subtract the mean of X and divide by the standard deviation of X. Why? Because we will be measuring variation. If we do not standardize data, we can convert one feature from km to cm, and cause that feature’s variance to increase. Standardization ensures that our data variance is independent of whatever transformations we might use. [1] For more on why we need to standardize data, read here. XA XB XC
  • 69. Finding Principal Component #1 How does the model learn? 2. To find the first Principal Component, find the direction of maximum variance. This is Principal Component #1 (PC1)! Formally, PC1 is the linear combination of the features… PC1 = c1 (XA ) + c2 (XB ) + c3 (XC ) … that has the largest variance. In our simplified example, we eyeball the direction in which the data varies most. Learning Methodology X1 X2 X3
  • 70. Finding Principal Component #2 How does the model learn? Learning Methodology 3. To find the second Principal Component, find the second highest direction of variance. This is Principal Component #2 (PC2)! Formally, PC2 is the linear combination of the features… PC2 = c4 (XA ) + c5 (XB ) + c6 (XC ) … that has the second largest variance AND is uncorrelated with PC1. In our simplified example, we can roughly eyeball the direction in which the variance is second highest. X1 X2 X3
  • 71. Finding Principal Component #3 How does the model learn? Learning Methodology 4. To find the third and final Principal Component, find the third highest direction of variance. This is Principal Component #3 (PC3)! Formally, PC3 is the linear combination of the features… PC3 = c7 (XA ) + c8 (XB ) + c9 (XC ) … that has the third largest variance AND is uncorrelated with both PC1 and PC2. In our simplified example, we can roughly eyeball the direction in which the variance is third highest. X1 X2 X3
  • 72. How many principal components can you have? How many do we want? For the sake of completeness, we calculated all the possible principal components (there can be as many principal components as features.) However, as our main aim is dimensionality reduction, we will keep the top principal components. How many should we keep? Recall our discussion of the elbow plot in k-means clustering. We use similar logic to see how many principal components we want. How does the model learn? Learning Methodology X1 X2 X3
  • 73. How many principal components can you have? How many do we want? If we choose to keep only PC1, we will be able to capture 80% of the original data’s information. Pretty good! If we choose to keep both PC1 and PC2, we will be able to capture 95% of the original data’s information. Even better! Note that if we chose to keep PC1, PC2, and PC3, we would capture 100% of the original data’s information. However, we wouldn’t choose to do this, as our main intention was dimension reduction. How does the model learn? Learning Methodology Number of components %Varianceexplained 1 2 3 15% 80% 5%
  • 74. For simplicity, our example had a dataset that comprised only 3 features. In reality, you probably wouldn’t apply PCA to such a simple dataset. The true utility of PCA comes from when we apply it to datasets with hundreds or thousands of features. Consider, for example, image processing, where every pixel of an image constitutes a feature.
  • 75. For each of these 321 × 261 pixel images, every pixel is a feature, resulting in 321 * 261 = 83,781 features for only 4 observations. This is a very high-dimensional dataset, but we can use PCA to simplify it. Source: http://people.ciirc.cvut.cz/~hlavac/TeachPresEn/11ImageProc/15PCA.pdf
  • 76. With PCA, we are able to reduce a dataset with 83,781 features and 4 observations to a dataset with 4 principal components and 4 observations. Compare the original images... … to images recreated using 4 principal components
  • 77. PCA is able to retain most of the information from the original dataset, while condensing the data quite a bit. Source: http://people.ciirc.cvut.cz/~hlavac/TeachPresEn/11ImageProc/15PCA.pdf
  • 80. Want to take this further? Here are some resources we recommend: ● Textbooks ○ Unsupervised learning, Introduction to Statistical Analysis, Chapter 10.2 ○ Hierarchical clustering, Introduction to Statistical Analysis, Chapter 10.3.2 ● Web resources ○ Clustering optimization using Silhouette plots ○ More applications of k-means: Anomaly detection
  • 81. You are on fire! Go straight to the next module here. Need to slow down and digest? Take a minute to write us an email about what you thought about the course. All feedback small or large welcome! Email: sara@deltanalytics.org
  • 82. Congrats! You finished module 7! Find out more about Delta’s machine learning for good mission here.