SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Self-Organising Maps for
Customer Segmentation
Theory and worked examples using census and
customer data sets.
Talk for Dublin R Users Group
20/01/2014
Shane Lynn Ph.D. – Data Scientist
www.shanelynn.ie / @shane_a_lynn
Overview
1. Why customer
segmentation?
2. What is a self-organising
map?
– Algorithm
– Uses
– Advantages

3. Example using Irish
Census Data
4. Example using Ta-Feng
Grocery Shopping Data
Dublin R / SOMs / Shane Lynn

2
Customer Segmentation
• Customer segmentation is the application of clustering
techniques to customer data
• Identify cohorts of “similar” customers – common
needs, priorities, characteristics.
“Agent Loyal”

Homogeneous
Customer Base

“Non-traditional”
Dublin R / SOMs / Shane Lynn

“Budget-concious”
3
Customer Segmentation
Typical Uses
• Targeted marketing
• Customer retention
• Debt recovery

Techniques
•
•
•
•
•

Single discrete variable
K-means clustering
Hierarchical clustering
Finite mixture modelling
Self Organising Maps

Dublin R / SOMs / Shane Lynn

4
Self-Organising Maps
A Self-Organising Map (SOM) is a form of unsupervised neural
network that produces a low (typically two) dimensional
representation of the input space of the set of training samples.
• First described by Teuvo Kohonen
(1982) (“Kohonen Map”)
• Over 10k citations referencing SOMs –
most cited Finnish scientist.
• Multi-dimensional input data is
represented by a 2-D “map” of nodes
• Topological properties of the input
space are maintained in map

Dublin R / SOMs / Shane Lynn

5
Self-Organising Maps
• The SOM visualisation is made up of several nodes
• Input samples are “mapped” to the most similar node on the SOM.
All attributes in input data are used to determine similarity.
• Each node has a weight vector of same size as the input space
• There is no variable / meaning to the x and y axes.

All nodes have:
Position
Weight Vector
Associated input
samples
Dublin R / SOMs / Shane Lynn

6
Self-Organising Maps
• Everyone in this room stands in Croke Park.
• Each person compares attributes – e.g. age,
gender, salary, height.
• Everyone moves until they are closest to
other people with the most similar
attributes. (using a Euclidean distance)
• If everyone holds up a card indicating their
age – the result is a SOM heatmap
< 20

30

40

50

> 60

Dublin R / SOMs / Shane Lynn

7
Self-Organising Maps
Algorithm
1.
2.

3.
4.

Initialise all weight vectors randomly.
Choose a random datapoint from training data and present it to
the SOM
Find the “Best Matching Unit” (BMU) in the map – the most
similar node.
Determine the nodes within the “neighbourhood” of the BMU.
–

5.

The size of the neighbourhood decreases with each iteration.

Adjust weights of nodes in the BMU neighbourhood towards the
chosen datapoint.
–
–

6.

First Choose: Map size, map type

The learning rate decreases with each iteration
The magnitude of the adjustment is proportional to the proximity of
the node to the BMU.

Repeat Steps 2-5 for N iterations / convergence.
Dublin R / SOMs / Shane Lynn

8
Self-Organising Maps
Algorithm
1.
2.

3.
4.

Initialise all weight vectors randomly.
Choose a random datapoint from training data and present it to
the SOM
Find the “Best Matching Unit” (BMU) in the map – the most
similar node.
Determine the nodes within the “neighbourhood” of the BMU.
–

5.

The size of the neighbourhood decreases with each iteration.

Adjust weights of nodes in the BMU neighbourhood towards the
chosen datapoint.
–
–

6.

First Choose: Map size, map type

The learning rate decreases with each iteration
The magnitude of the adjustment is proportional to the proximity of
the node to the BMU.

Repeat Steps 2-5 for N iterations / convergence.
Dublin R / SOMs / Shane Lynn

9
Self-Organising Maps
Example – Color Classification

• SOM training on RGB values. (R,G,B) (255,0,0)
• 3-D dataset -> 2-D SOM representation
• Similar colours have similar RGB values / similar
position

Dublin R / SOMs / Shane Lynn

10
Self-Organising Maps

Dublin R / SOMs / Shane Lynn

11
Self-Organising Maps
• On the SOM, actual distance between nodes is not a
measure for quantitative (dis)similarity
• Unified distance matrix
– Distance between node weights of neighbouring nodes.
– Used to determine clusters of nodes on map

http://davis.wpi.edu/~matt/courses/soms/applet.html
SOM

U-matrix

Dublin R / SOMs / Shane Lynn

12
Self-Organising Maps
Heatmaps
• Heatmaps are used to discover patterns
between variables
• Heatmaps colour the map by chosen
variables – Each node coloured using
average value of all linked data points
• Can use variables not in in the training
set
Gender

Dummy Variables
• SOMs only operate with
numeric variables
• Categorical variables
must be converted to
dummy variables.

Gender_M

Gender_F

Male

1

0

Female

0

1

Female

0

1

Male

1

0

Dublin R / SOMs / Shane Lynn

13
Self-Organising Maps
Data collection
Form data with
samples relevant to
subject

Data scaling
Center and scale
data. Prevents any
one variable
dominating model

Data “capping”
Removing outliers
improves heatmap
scales

Examine heatmaps
Build SOM
Examine clusters

Dublin R / SOMs / Shane Lynn

Choose SOM
parameters and run
iterative algorithm

14
Self-Organising Maps
• “Kohonen” package in R
• Well documented with examples
• Key functions:
– somgrid()
• Initialise a SOM node set

– som()

Other Software
•
•
•
•
•
•

Viscovery
SPSS Modeler
MATLAB (NN toolbox)
WEKA
Synapse
Spice-SOM

• Can change – radius, learning rate, iterations,
neighbourhood type

– plot.kohonen()
• Visualisation of resulting SOM
• Supports heatmaps, node counts, properties, u-matrix etc.

Dublin R / SOMs / Shane Lynn

15
Census Data Example
• Data from Census 2011
• Population divided in 18,488 “small areas” (SA)
• 767 variables
available on 15
themes.
• Interactive viewing
software online.
• Data available via
csv download.
• Focusing on Dublin
area for this talk.
• Aim to segment
based on selected
variables.
Dublin R / SOMs / Shane Lynn

16
Census Data Example
data <- read.csv("../data/census-data-smallarea.csv")

• Data is formatted as person counts per SA
• Need to extract comparable statistics from count data
– Change count data to percentages / numbers etc.
marital_data <- data[, c("T1_2SGLT", "T1_2MART", "T1_2SEPT",
"T1_2DIVT", "T1_2WIDT", "T1_2T")]
marital_percents <- data.frame(t(apply(marital_data, 1,
function(x) {x[1:5]/x[6]})) * 100)

– Change “ranked” variables to numeric values
i.e. Education: [None, Primary, Secondary, 3rd Level,
Postgrad] changed to [0,1,2,3,4]
Allows us to calculate “mean education level”
Dublin R / SOMs / Shane Lynn

17
Census Data Example
# average household size
household_data <- data[, 331:339] #data on household sizes and totals
mean_household_size <- c(1:8)
#sizes 1-8 persons
results$avr_household_size <- apply(household_data, 1,
function(x){
size <- sum(x[1:length(mean_household_size)] *
mean_household_size) / x[length(x)]
}
)

• Calculated features for small areas:
> names(data)
[1] "id"
"avr_age"
"avr_education_level" "avr_num_cars"
[6] "avr_health"
"rented_percent"
"internet_percent"
"single_percent"
[11] "married_percent"
"separated_percent"
"widow_percent"

"avr_household_size"
"unemployment_percent"
"divorced_percent"

• Filtered for Fingal, Dublin City, South Dublin, & Dun Laoghaire (4806 SAs)
Dublin R / SOMs / Shane Lynn

18
Census Data Example
• SOM creation step
require(kohonen)
data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment
data_train_matrix <- as.matrix(scale(data_train))
som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
som_model <- som(data_train_matrix,
grid=som_grid,
rlen=100,
alpha=c(0.05,0.01),
keep.data = TRUE,
n.hood=“circular” )

Dublin R / SOMs / Shane Lynn

19
Census Data Example
Kohonen functions accept
numeric matrices

• SOM creation step

scale() mean centers and
normalises all columns

require(kohonen)

data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment
data_train_matrix <- as.matrix(scale(data_train))
som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
som_model <- som(data_train_matrix,
grid=som_grid,
rlen=100,
alpha=c(0.05,0.01),
keep.data = TRUE,
n.hood=“circular” )

Creating hexagonal SOM grid with
400 nodes (10 samples per node)
Params:
• Rlen – times to present data
• Alpha – learning rate
• Keep.data – store data in model
• n.hood – neighbourhood shape

Dublin R / SOMs / Shane Lynn

20
Census Data Example
• som_model object
>> summary(som_model)
som map of size 20x20 with a hexagonal topology.
Training data included; dimension is 4806 by 4
Mean distance to the closest unit in the map: 0.1158323

• Contains all mapping details between samples
and nodes.
• Accessible via “$” operator
• Various plotting options to visualise map
results.
Dublin R / SOMs / Shane Lynn

21
Census Data Example
plot(som_model, type = "changes")

plot(som_model, type = “counts")

Dublin R / SOMs / Shane Lynn

22
Census Data Example
Minimising “mean
distance to closest unit”

plot(som_model, type = "changes")

plot(som_model, type = “counts")

Not default colour palette. (coolBlueHotRed.R)
plot(… , palette.name=coolBluehotRed)

Dublin R / SOMs / Shane Lynn

23
Census Data Example
plot(som_model, type =
“dist.neighbours")

plot(som_model, type = “codes")

Dublin R / SOMs / Shane Lynn

24
• Fan diagram shows
distribution of variables
across map.
• Can see patterns by
examining dominant
colours etc.
• This type of
representation is useful
for SOMs when the
number of variables is
less than ~ 5
• Good to get a grasp of
general patterns in SOM

Dublin R / SOMs / Shane Lynn

25
Census Data Exampleof highest
Area
education

Area with highest
average ages

Area of highest
unemployment

Dublin R / SOMs / Shane Lynn

26
Census Data Example
plot(som_model, type = "property", property = som_model$codes[,4],
main=names(som_model$data)[4], palette.name=coolBlueHotRed)

Using training data, default
scales are normalised – not as
meaningful for users.

Dublin R / SOMs / Shane Lynn

27
Census Data Example
var <- 2 #define the variable to plot
var_unscaled <- aggregate(as.numeric(data_train[,var]),
by=list(som_model$unit.classif),
FUN=mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled,
main=names(data_train)[var], palette.name=coolBlueHotRed)

Dublin R / SOMs / Shane Lynn

28
Census Data Example
• Very simple to manually identify clusters of nodes.
• By visualising heatmaps, can spot relationships between variables.
• View heatmaps of variables not used to drive the SOM generation.
plotHeatMap(som_model, data, variable=0) #interactive window for selection

Dublin R / SOMs / Shane Lynn

29
Census Data Example
• Very simple to manually identify clusters of nodes.
• By visualising heatmaps, can spot relationships between variables.
• View heatmaps of variables not used to drive the SOM generation.
plotHeatMap(som_model, data, variable=0) #interactive window for selection

Dublin R / SOMs / Shane Lynn

30
Census Data Example
mydata <- som_model$codes
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)

• Cluster on som codebooks
• Hierarchical clustering
– Start with every node as a
cluster
– Combine most similar
nodes (once they are
neighbours)
– Continue until all combined

• User decision as to how
many clusters suit
application
Dublin R / SOMs / Shane Lynn

31
Census Data Example
som_cluster <- clusterSOM(ClusterCodebooks=som_model$codes,
LengthClustering=nrow(som_model$codes),
Mapping=som_model$unit.classif,
MapRow=som_model$grid$xdim,
MapColumn=som_model$grid$ydim,
StoppingCluster=2 )[[5]]
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main =
"Clusters")
add.cluster.boundaries(som_model, som_cluster)

Cluster 5

Cluster 4
Cluster 3

Clusters
numbered from
bottom left
- Cluster 1

Cluster 2

Dublin R / SOMs / Shane Lynn

32
Dublin R / SOMs / Shane Lynn

33
Census Data Example
- High level of unemployment
- Low average education
- Lower than average car
ownership
“Lower Demographics”

Dublin R / SOMs / Shane Lynn

34
Census Data Example
Cluster 5 details

Dublin R / SOMs / Shane Lynn

- High car ownership
- Well educated
- Relatively older demographic
- Larger households
“Commuter Belt”
35
Census Data Example

Dublin R / SOMs / Shane Lynn

36
Census Data Example
Conclusions
• Can build meaningful names
/ stories for clusters in
population
• Multiple iterations with
different variable
combinations and heatmaps
• Similar to work by
commercial data providers
“Experian”

Dublin R / SOMs / Shane Lynn

37
Ta-Feng Grocery Shopping
• More complex and realistic example of customer data
• Data set contains information on
– 817,741 grocery shopping transactions.
– 32,266 customers
customer_i
date
d
age_group
01/01/2001 141833
F
01/01/2001 1376753
E
01/01/2001 1603071
E
01/01/2001 1738667
E
01/01/2001 2141497
A
01/01/2001 1868685
J

address
F
E
G
F
B
E

product_su
bclass
130207
110217
100201
530105
320407
110109

product_id quantity
4.71E+12
2
4.71E+12
1
4.71E+12
1
4.71E+12
1
4.71E+12
1
4.71E+12
1

asset
44
150
35
94
100
144

price
52
129
39
119
159
190

• Similar steps as with census example, but with some
additional requirements
Dublin R / SOMs / Shane Lynn

38
Ta-Feng Grocery Shopping
• Feature generation using data.table - Data table
has significant speed enhancements over
standard data frame, or ddply.
> system.time(test1 <- data[, sum(price), by=customer_id])
user system elapsed
0.11
0.00
0.11
> system.time(test2 <- aggregate(data$price,
by=list(data$customer_id), FUN=sum))
user system elapsed
1.81
0.03
1.84
> system.time(test3 <- ddply(data[,c("price", "customer_id"),
with=F],
+
.variables="customer_id",
+
.fun=function(x){sum(x$price)}))
user system elapsed
5.86
0.00
5.86
Dublin R / SOMs / Shane Lynn

39
Ta-Feng Grocery Shopping
• Customer characteristics generated:
Customer ID

Total Spend

Max Item

Num Baskets

Total Items

Items per basket

Mean item cost

Profit per basket

Total Profit

Sales Items (%)

Profit Margin

Budget items (%)

Premium items (%)

Item rankings

Weekend baskets
(%)

Max basket cost

Outliers here will skew all
heat map colour scales

Dublin R / SOMs / Shane Lynn

40
Ta-Feng Grocery Shopping
• Solution is to cap all variables at 98% percentiles
• This removes extreme outliers from dataset and
prevents one node discolouring heatmaps
capVector <- function(x, probs = c(0.02,0.98)){
#remove things above and below the percentiles specified
ranges <- quantile(x, probs=probs, na.rm=T)
x[x < ranges[1]] <- ranges[1]
x[x > ranges[2]] <- ranges[2]
return(x)
}

Dublin R / SOMs / Shane Lynn

41
Ta-Feng Grocery Shopping

• Prior to capping the
variables – outliers
expand the range of
the variables
considerably

Dublin R / SOMs / Shane Lynn

42
Ta-Feng Grocery Shopping

• Histograms have
more manageable
ranges after
capping the
variables.

Dublin R / SOMs / Shane Lynn

43
Ta-Feng Grocery Shopping
• Explore heatmaps
and variable
distributions to find
patterns

Dublin R / SOMs / Shane Lynn

44
Dublin R / SOMs / Shane Lynn

45
Ta-Feng Grocery Shopping
“Steady shoppers”
• Many visits to store
• Profitable overall
• Smaller Baskets
“Bargain Hunters”
- Almost exclusively sale items
- Few visits
- Not profitable

“The Big Baskets”
- Few visits
- Largest baskets
Dublin R
- Large profit per basket/ SOMs / Shane Lynn

“The great unwashed”
- Single / small repeat visitors
- Small baskets
- Large number of customers
46
Conclusions
• SOMs are a powerful tool to have in your repertoire.
• Advantages include:
– Intuitive method to develop customer segmentation profiles.
– Relatively simple algorithm, easy to explain results to non-data
scientists
– New data points can be mapped to trained model for predictive
purposes.

• Disadvantages:
– Lack of parallelisation capabilities for VERY large data sets
– Difficult to represent very many variables in two dimensional
plane
– Requires clean, numeric data.

• Slides and code will be posted online shortly.
Dublin R / SOMs / Shane Lynn

47
Questions

Shane Lynn
www.shanelynn.ie / @shane_a_lynn
Dublin R / SOMs / Shane Lynn

48
Useful links
• Som package
– http://cran.r-project.org/web/packages/kohonen/kohonen.pdf

• Census data and online viewer
– http://census.cso.ie/sapmap/
– http://www.cso.ie/en/census/census2011smallareapopulationstatisticssaps/

• Ta Feng data set download
– http://recsyswiki.com/wiki/Grocery_shopping_datasets

Dublin R / SOMs / Shane Lynn

49

Weitere ähnliche Inhalte

Was ist angesagt?

Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Presentation on deformable model for medical image segmentation
Presentation on deformable model for medical image segmentationPresentation on deformable model for medical image segmentation
Presentation on deformable model for medical image segmentationSubhash Basistha
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation systemAkashPatil334
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithmswapnac12
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Hsing-chuan Hsieh
 
3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAM3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAMEdwardIm1
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Missing data
Missing dataMissing data
Missing datamandava57
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketakiKetaki Patwari
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmChenghao Jin
 
Arima model (time series)
Arima model (time series)Arima model (time series)
Arima model (time series)Kumar P
 
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: Pooling
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: PoolingDeep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: Pooling
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: PoolingKirill Eremenko
 
A Simple Framework for Contrastive Learning of Visual Representations
A Simple Framework for Contrastive Learning of Visual RepresentationsA Simple Framework for Contrastive Learning of Visual Representations
A Simple Framework for Contrastive Learning of Visual RepresentationsSeunghyun Hwang
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 

Was ist angesagt? (20)

Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Presentation on deformable model for medical image segmentation
Presentation on deformable model for medical image segmentationPresentation on deformable model for medical image segmentation
Presentation on deformable model for medical image segmentation
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation system
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)
 
3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAM3D Rigid Body Transformation for SLAM
3D Rigid Body Transformation for SLAM
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Missing data
Missing dataMissing data
Missing data
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
Kmeans
KmeansKmeans
Kmeans
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
 
Arima model (time series)
Arima model (time series)Arima model (time series)
Arima model (time series)
 
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: Pooling
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: PoolingDeep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: Pooling
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 2: Pooling
 
A Simple Framework for Contrastive Learning of Visual Representations
A Simple Framework for Contrastive Learning of Visual RepresentationsA Simple Framework for Contrastive Learning of Visual Representations
A Simple Framework for Contrastive Learning of Visual Representations
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 

Ähnlich wie Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R

Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2malathieswaran29
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineersIBM Analytics
 
Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014DUSPviz
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Dhilsath Fathima
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
 
Algorithm explanations
Algorithm explanationsAlgorithm explanations
Algorithm explanationsnikita kapil
 
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPAAir Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPASTEP_scotland
 
spatial and spatio-temporal analysis in small area
spatial and spatio-temporal analysis in small areaspatial and spatio-temporal analysis in small area
spatial and spatio-temporal analysis in small areaYonas992841
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologyVladyslav Frolov
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 

Ähnlich wie Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R (20)

19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
project
projectproject
project
 
project
projectproject
project
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineers
 
Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
FLY-SMOTE.pdf
FLY-SMOTE.pdfFLY-SMOTE.pdf
FLY-SMOTE.pdf
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
Algorithm explanations
Algorithm explanationsAlgorithm explanations
Algorithm explanations
 
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPAAir Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
 
spatial and spatio-temporal analysis in small area
spatial and spatio-temporal analysis in small areaspatial and spatio-temporal analysis in small area
spatial and spatio-temporal analysis in small area
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of Technology
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 

Kürzlich hochgeladen

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Kürzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R

  • 1. Self-Organising Maps for Customer Segmentation Theory and worked examples using census and customer data sets. Talk for Dublin R Users Group 20/01/2014 Shane Lynn Ph.D. – Data Scientist www.shanelynn.ie / @shane_a_lynn
  • 2. Overview 1. Why customer segmentation? 2. What is a self-organising map? – Algorithm – Uses – Advantages 3. Example using Irish Census Data 4. Example using Ta-Feng Grocery Shopping Data Dublin R / SOMs / Shane Lynn 2
  • 3. Customer Segmentation • Customer segmentation is the application of clustering techniques to customer data • Identify cohorts of “similar” customers – common needs, priorities, characteristics. “Agent Loyal” Homogeneous Customer Base “Non-traditional” Dublin R / SOMs / Shane Lynn “Budget-concious” 3
  • 4. Customer Segmentation Typical Uses • Targeted marketing • Customer retention • Debt recovery Techniques • • • • • Single discrete variable K-means clustering Hierarchical clustering Finite mixture modelling Self Organising Maps Dublin R / SOMs / Shane Lynn 4
  • 5. Self-Organising Maps A Self-Organising Map (SOM) is a form of unsupervised neural network that produces a low (typically two) dimensional representation of the input space of the set of training samples. • First described by Teuvo Kohonen (1982) (“Kohonen Map”) • Over 10k citations referencing SOMs – most cited Finnish scientist. • Multi-dimensional input data is represented by a 2-D “map” of nodes • Topological properties of the input space are maintained in map Dublin R / SOMs / Shane Lynn 5
  • 6. Self-Organising Maps • The SOM visualisation is made up of several nodes • Input samples are “mapped” to the most similar node on the SOM. All attributes in input data are used to determine similarity. • Each node has a weight vector of same size as the input space • There is no variable / meaning to the x and y axes. All nodes have: Position Weight Vector Associated input samples Dublin R / SOMs / Shane Lynn 6
  • 7. Self-Organising Maps • Everyone in this room stands in Croke Park. • Each person compares attributes – e.g. age, gender, salary, height. • Everyone moves until they are closest to other people with the most similar attributes. (using a Euclidean distance) • If everyone holds up a card indicating their age – the result is a SOM heatmap < 20 30 40 50 > 60 Dublin R / SOMs / Shane Lynn 7
  • 8. Self-Organising Maps Algorithm 1. 2. 3. 4. Initialise all weight vectors randomly. Choose a random datapoint from training data and present it to the SOM Find the “Best Matching Unit” (BMU) in the map – the most similar node. Determine the nodes within the “neighbourhood” of the BMU. – 5. The size of the neighbourhood decreases with each iteration. Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint. – – 6. First Choose: Map size, map type The learning rate decreases with each iteration The magnitude of the adjustment is proportional to the proximity of the node to the BMU. Repeat Steps 2-5 for N iterations / convergence. Dublin R / SOMs / Shane Lynn 8
  • 9. Self-Organising Maps Algorithm 1. 2. 3. 4. Initialise all weight vectors randomly. Choose a random datapoint from training data and present it to the SOM Find the “Best Matching Unit” (BMU) in the map – the most similar node. Determine the nodes within the “neighbourhood” of the BMU. – 5. The size of the neighbourhood decreases with each iteration. Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint. – – 6. First Choose: Map size, map type The learning rate decreases with each iteration The magnitude of the adjustment is proportional to the proximity of the node to the BMU. Repeat Steps 2-5 for N iterations / convergence. Dublin R / SOMs / Shane Lynn 9
  • 10. Self-Organising Maps Example – Color Classification • SOM training on RGB values. (R,G,B) (255,0,0) • 3-D dataset -> 2-D SOM representation • Similar colours have similar RGB values / similar position Dublin R / SOMs / Shane Lynn 10
  • 11. Self-Organising Maps Dublin R / SOMs / Shane Lynn 11
  • 12. Self-Organising Maps • On the SOM, actual distance between nodes is not a measure for quantitative (dis)similarity • Unified distance matrix – Distance between node weights of neighbouring nodes. – Used to determine clusters of nodes on map http://davis.wpi.edu/~matt/courses/soms/applet.html SOM U-matrix Dublin R / SOMs / Shane Lynn 12
  • 13. Self-Organising Maps Heatmaps • Heatmaps are used to discover patterns between variables • Heatmaps colour the map by chosen variables – Each node coloured using average value of all linked data points • Can use variables not in in the training set Gender Dummy Variables • SOMs only operate with numeric variables • Categorical variables must be converted to dummy variables. Gender_M Gender_F Male 1 0 Female 0 1 Female 0 1 Male 1 0 Dublin R / SOMs / Shane Lynn 13
  • 14. Self-Organising Maps Data collection Form data with samples relevant to subject Data scaling Center and scale data. Prevents any one variable dominating model Data “capping” Removing outliers improves heatmap scales Examine heatmaps Build SOM Examine clusters Dublin R / SOMs / Shane Lynn Choose SOM parameters and run iterative algorithm 14
  • 15. Self-Organising Maps • “Kohonen” package in R • Well documented with examples • Key functions: – somgrid() • Initialise a SOM node set – som() Other Software • • • • • • Viscovery SPSS Modeler MATLAB (NN toolbox) WEKA Synapse Spice-SOM • Can change – radius, learning rate, iterations, neighbourhood type – plot.kohonen() • Visualisation of resulting SOM • Supports heatmaps, node counts, properties, u-matrix etc. Dublin R / SOMs / Shane Lynn 15
  • 16. Census Data Example • Data from Census 2011 • Population divided in 18,488 “small areas” (SA) • 767 variables available on 15 themes. • Interactive viewing software online. • Data available via csv download. • Focusing on Dublin area for this talk. • Aim to segment based on selected variables. Dublin R / SOMs / Shane Lynn 16
  • 17. Census Data Example data <- read.csv("../data/census-data-smallarea.csv") • Data is formatted as person counts per SA • Need to extract comparable statistics from count data – Change count data to percentages / numbers etc. marital_data <- data[, c("T1_2SGLT", "T1_2MART", "T1_2SEPT", "T1_2DIVT", "T1_2WIDT", "T1_2T")] marital_percents <- data.frame(t(apply(marital_data, 1, function(x) {x[1:5]/x[6]})) * 100) – Change “ranked” variables to numeric values i.e. Education: [None, Primary, Secondary, 3rd Level, Postgrad] changed to [0,1,2,3,4] Allows us to calculate “mean education level” Dublin R / SOMs / Shane Lynn 17
  • 18. Census Data Example # average household size household_data <- data[, 331:339] #data on household sizes and totals mean_household_size <- c(1:8) #sizes 1-8 persons results$avr_household_size <- apply(household_data, 1, function(x){ size <- sum(x[1:length(mean_household_size)] * mean_household_size) / x[length(x)] } ) • Calculated features for small areas: > names(data) [1] "id" "avr_age" "avr_education_level" "avr_num_cars" [6] "avr_health" "rented_percent" "internet_percent" "single_percent" [11] "married_percent" "separated_percent" "widow_percent" "avr_household_size" "unemployment_percent" "divorced_percent" • Filtered for Fingal, Dublin City, South Dublin, & Dun Laoghaire (4806 SAs) Dublin R / SOMs / Shane Lynn 18
  • 19. Census Data Example • SOM creation step require(kohonen) data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment data_train_matrix <- as.matrix(scale(data_train)) som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal") som_model <- som(data_train_matrix, grid=som_grid, rlen=100, alpha=c(0.05,0.01), keep.data = TRUE, n.hood=“circular” ) Dublin R / SOMs / Shane Lynn 19
  • 20. Census Data Example Kohonen functions accept numeric matrices • SOM creation step scale() mean centers and normalises all columns require(kohonen) data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment data_train_matrix <- as.matrix(scale(data_train)) som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal") som_model <- som(data_train_matrix, grid=som_grid, rlen=100, alpha=c(0.05,0.01), keep.data = TRUE, n.hood=“circular” ) Creating hexagonal SOM grid with 400 nodes (10 samples per node) Params: • Rlen – times to present data • Alpha – learning rate • Keep.data – store data in model • n.hood – neighbourhood shape Dublin R / SOMs / Shane Lynn 20
  • 21. Census Data Example • som_model object >> summary(som_model) som map of size 20x20 with a hexagonal topology. Training data included; dimension is 4806 by 4 Mean distance to the closest unit in the map: 0.1158323 • Contains all mapping details between samples and nodes. • Accessible via “$” operator • Various plotting options to visualise map results. Dublin R / SOMs / Shane Lynn 21
  • 22. Census Data Example plot(som_model, type = "changes") plot(som_model, type = “counts") Dublin R / SOMs / Shane Lynn 22
  • 23. Census Data Example Minimising “mean distance to closest unit” plot(som_model, type = "changes") plot(som_model, type = “counts") Not default colour palette. (coolBlueHotRed.R) plot(… , palette.name=coolBluehotRed) Dublin R / SOMs / Shane Lynn 23
  • 24. Census Data Example plot(som_model, type = “dist.neighbours") plot(som_model, type = “codes") Dublin R / SOMs / Shane Lynn 24
  • 25. • Fan diagram shows distribution of variables across map. • Can see patterns by examining dominant colours etc. • This type of representation is useful for SOMs when the number of variables is less than ~ 5 • Good to get a grasp of general patterns in SOM Dublin R / SOMs / Shane Lynn 25
  • 26. Census Data Exampleof highest Area education Area with highest average ages Area of highest unemployment Dublin R / SOMs / Shane Lynn 26
  • 27. Census Data Example plot(som_model, type = "property", property = som_model$codes[,4], main=names(som_model$data)[4], palette.name=coolBlueHotRed) Using training data, default scales are normalised – not as meaningful for users. Dublin R / SOMs / Shane Lynn 27
  • 28. Census Data Example var <- 2 #define the variable to plot var_unscaled <- aggregate(as.numeric(data_train[,var]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2] plot(som_model, type = "property", property=var_unscaled, main=names(data_train)[var], palette.name=coolBlueHotRed) Dublin R / SOMs / Shane Lynn 28
  • 29. Census Data Example • Very simple to manually identify clusters of nodes. • By visualising heatmaps, can spot relationships between variables. • View heatmaps of variables not used to drive the SOM generation. plotHeatMap(som_model, data, variable=0) #interactive window for selection Dublin R / SOMs / Shane Lynn 29
  • 30. Census Data Example • Very simple to manually identify clusters of nodes. • By visualising heatmaps, can spot relationships between variables. • View heatmaps of variables not used to drive the SOM generation. plotHeatMap(som_model, data, variable=0) #interactive window for selection Dublin R / SOMs / Shane Lynn 30
  • 31. Census Data Example mydata <- som_model$codes wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) • Cluster on som codebooks • Hierarchical clustering – Start with every node as a cluster – Combine most similar nodes (once they are neighbours) – Continue until all combined • User decision as to how many clusters suit application Dublin R / SOMs / Shane Lynn 31
  • 32. Census Data Example som_cluster <- clusterSOM(ClusterCodebooks=som_model$codes, LengthClustering=nrow(som_model$codes), Mapping=som_model$unit.classif, MapRow=som_model$grid$xdim, MapColumn=som_model$grid$ydim, StoppingCluster=2 )[[5]] plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters") add.cluster.boundaries(som_model, som_cluster) Cluster 5 Cluster 4 Cluster 3 Clusters numbered from bottom left - Cluster 1 Cluster 2 Dublin R / SOMs / Shane Lynn 32
  • 33. Dublin R / SOMs / Shane Lynn 33
  • 34. Census Data Example - High level of unemployment - Low average education - Lower than average car ownership “Lower Demographics” Dublin R / SOMs / Shane Lynn 34
  • 35. Census Data Example Cluster 5 details Dublin R / SOMs / Shane Lynn - High car ownership - Well educated - Relatively older demographic - Larger households “Commuter Belt” 35
  • 36. Census Data Example Dublin R / SOMs / Shane Lynn 36
  • 37. Census Data Example Conclusions • Can build meaningful names / stories for clusters in population • Multiple iterations with different variable combinations and heatmaps • Similar to work by commercial data providers “Experian” Dublin R / SOMs / Shane Lynn 37
  • 38. Ta-Feng Grocery Shopping • More complex and realistic example of customer data • Data set contains information on – 817,741 grocery shopping transactions. – 32,266 customers customer_i date d age_group 01/01/2001 141833 F 01/01/2001 1376753 E 01/01/2001 1603071 E 01/01/2001 1738667 E 01/01/2001 2141497 A 01/01/2001 1868685 J address F E G F B E product_su bclass 130207 110217 100201 530105 320407 110109 product_id quantity 4.71E+12 2 4.71E+12 1 4.71E+12 1 4.71E+12 1 4.71E+12 1 4.71E+12 1 asset 44 150 35 94 100 144 price 52 129 39 119 159 190 • Similar steps as with census example, but with some additional requirements Dublin R / SOMs / Shane Lynn 38
  • 39. Ta-Feng Grocery Shopping • Feature generation using data.table - Data table has significant speed enhancements over standard data frame, or ddply. > system.time(test1 <- data[, sum(price), by=customer_id]) user system elapsed 0.11 0.00 0.11 > system.time(test2 <- aggregate(data$price, by=list(data$customer_id), FUN=sum)) user system elapsed 1.81 0.03 1.84 > system.time(test3 <- ddply(data[,c("price", "customer_id"), with=F], + .variables="customer_id", + .fun=function(x){sum(x$price)})) user system elapsed 5.86 0.00 5.86 Dublin R / SOMs / Shane Lynn 39
  • 40. Ta-Feng Grocery Shopping • Customer characteristics generated: Customer ID Total Spend Max Item Num Baskets Total Items Items per basket Mean item cost Profit per basket Total Profit Sales Items (%) Profit Margin Budget items (%) Premium items (%) Item rankings Weekend baskets (%) Max basket cost Outliers here will skew all heat map colour scales Dublin R / SOMs / Shane Lynn 40
  • 41. Ta-Feng Grocery Shopping • Solution is to cap all variables at 98% percentiles • This removes extreme outliers from dataset and prevents one node discolouring heatmaps capVector <- function(x, probs = c(0.02,0.98)){ #remove things above and below the percentiles specified ranges <- quantile(x, probs=probs, na.rm=T) x[x < ranges[1]] <- ranges[1] x[x > ranges[2]] <- ranges[2] return(x) } Dublin R / SOMs / Shane Lynn 41
  • 42. Ta-Feng Grocery Shopping • Prior to capping the variables – outliers expand the range of the variables considerably Dublin R / SOMs / Shane Lynn 42
  • 43. Ta-Feng Grocery Shopping • Histograms have more manageable ranges after capping the variables. Dublin R / SOMs / Shane Lynn 43
  • 44. Ta-Feng Grocery Shopping • Explore heatmaps and variable distributions to find patterns Dublin R / SOMs / Shane Lynn 44
  • 45. Dublin R / SOMs / Shane Lynn 45
  • 46. Ta-Feng Grocery Shopping “Steady shoppers” • Many visits to store • Profitable overall • Smaller Baskets “Bargain Hunters” - Almost exclusively sale items - Few visits - Not profitable “The Big Baskets” - Few visits - Largest baskets Dublin R - Large profit per basket/ SOMs / Shane Lynn “The great unwashed” - Single / small repeat visitors - Small baskets - Large number of customers 46
  • 47. Conclusions • SOMs are a powerful tool to have in your repertoire. • Advantages include: – Intuitive method to develop customer segmentation profiles. – Relatively simple algorithm, easy to explain results to non-data scientists – New data points can be mapped to trained model for predictive purposes. • Disadvantages: – Lack of parallelisation capabilities for VERY large data sets – Difficult to represent very many variables in two dimensional plane – Requires clean, numeric data. • Slides and code will be posted online shortly. Dublin R / SOMs / Shane Lynn 47
  • 48. Questions Shane Lynn www.shanelynn.ie / @shane_a_lynn Dublin R / SOMs / Shane Lynn 48
  • 49. Useful links • Som package – http://cran.r-project.org/web/packages/kohonen/kohonen.pdf • Census data and online viewer – http://census.cso.ie/sapmap/ – http://www.cso.ie/en/census/census2011smallareapopulationstatisticssaps/ • Ta Feng data set download – http://recsyswiki.com/wiki/Grocery_shopping_datasets Dublin R / SOMs / Shane Lynn 49