SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
LondonR March 2014
Jonathan Sedar, Applied AI
@jonsedar
Customer Clustering for
Retailer Marketing
An exploratory data science project with
reference to useful R packages
Customer Clustering something
something something ...
● Thats a convoluted title!
● Client problem is really: “Tell me something about my
customer base”
● Time to do some “Exploratory Data Science”
● aka data hacking
● Take a sample of data, fire up an R session and lets
see what we can learn…
Process R Toolkit
Sourcing, Cleaning & Exploration
What does the data let us do?
read.table {utils}
ggplot {ggplot2}
lubridate {lubridate}
Feature Creation
Extract additional information to enrich the set
data.table {data.table}
cut2 {Hmisc}
dcast {reshape2}
Feature Selection
Reduce to a smaller dataset to speed up computation
covMcd {robustbase}
scale {base}
prcomp {stats}
Clustering
Finding similar customers without prior information
… and interpreting the results
mclust {mclust}
Overview
“Who’s shopping at my stores
and how can I market to them?”
Intro
Lets try to group them by
similar shopping behaviours
Intro
Benefits: customer profiling enables
targeted marketing and operations
Retention offers
Product promotions
Loyalty rewards
Optimise stock levels &
store layout
Bob
Lives “SW15”
Age “40”
Bob
Lives “SW15”
Age “40”
Type: Family First
Intro
Turning existing data into insights
A real dataset:
32,000 customers
24,000 items
800,000
transactions
… over a 4 month pd Magic
Many ways to approach the problem!
R has tools to help reach a quick solution
Intro
Exploratory data science projects tend
to have a similar structure
Model
optimisation &
interpretation
Source your
raw data
Cleaning &
importing
Exploration &
visualisation
Feature
creation
Feature
selection
Model
creation
Intro
Data sources
Sourcing,
Cleansing,
Exploration
Sample dataset
Sourcing,
Cleansing,
Exploration
“Ta-Feng” grocery shopping dataset
800,000 transactions
32,000 customer ids
24,000 product ids
4-month period over Winter 2000-2001
http://recsyswiki.com/wiki/Grocery_shopping_datasets
trans_date cust_id age res_area product_id quantity price
1: 2000-11-01 00046855 D E 4710085120468 3 57
2: 2000-11-01 00539166 E E 4714981010038 2 48
3: 2000-11-01 00663373 F E 4710265847666 1 135
...
Data definition and audit (1 of 2)
A README file, excellent...
4 ASCII text files:
# D11: Transaction data collected in November, 2000
# D12: Transaction data collected in December, 2000
# D01: Transaction data collected in January, 2001
# D02: Transaction data collected in February, 2001
Curious choice of delimiter and an extended charset
# First line: Column definition in Traditional Chinese
# §È¥¡;∑|≠˚•d∏π;¶~ƒ÷;∞œ∞Ï;∞”´~§¿√˛;∞”´~ΩsΩX;º∆∂q;¶®•ª;æP∞‚
# Second line and the rest: data columns separated by ";"
Preclean with shell tools
awk -F";" 'gsub(":","",$1)' D02
Sourcing,
Cleansing,
Exploration
Data definition and audit (2 of 2)
Be prepared for undocumented gotchas:
# 1: Transaction date and time (time invalid and useless)
# 2: Customer ID
# 3: Age: 10 possible values,
# A <25,B 25-29,C 30-34,D 35-39,E 40-44,F 45-49,G 50-54,H 55-59,I
60-64,J >65
# actually there's 22362 rows with value K, will assume it's Unknown
# 4: Residence Area: 8 possible values,
# A-F: zipcode area: 105,106,110,114,115,221,G: others, H: Unknown
# Distance to store, from the closest: 115,221,114,105,106,110
# so we’ll factor this with levels "E","F","D","A","B","C","G","H"
# 5: Product subclass
# 6: Product ID
# 7: Amount
# 8: Asset # not explained, low values, not an id, will
ignore
# 9: Sales price
Sourcing,
Cleansing,
Exploration
Import & preprocess (read.table)
Sourcing,
Cleansing,
Exploration
Read each file into a data.table, whilst applying basic data types
> dtnov <- data.table(read.table(fqn
,col.names=cl$names
,colClasses=cl$types
,encoding="UTF-8"
,stringsAsFactors=F));
Alternatives inc RODBC / RPostgreSQL
> con <- dbConnect(dbDriver("PostgreSQL"), host="localhost", port=5432
,dbname=”tafeng”, user=”jon”, password=””)
> dtnov <- dbGetQuery(con,"select * from NovTransactions")
Import & preprocess (lubridate)
Sourcing,
Cleansing,
Exploration
Convert some datatypes to be more useful:
String -> POSIXct datetime (UNIX time UTC) using lubridate
> dtraw[,trans_date:= ymd(trans_date)]
> cat(ymd("2013-11-05"))
1383609600
… also, applying factor levels to the residential area
> dtraw[,res_area:= factor(res_area
,levels=c("E","F","D","A","B","C","G","H") )]
Explore: Group By (data.table)
Sourcing,
Cleansing,
Exploration
How many transactions, dates, customers, products, product subclasses?
> nrow(dt[,.N,by=cust_id]) # 32,266
Using data.table’s dt[i,j,k] structure where:
i subselects rows SQL WHERE
j selects / creates columns SQL SELECT
k groups by columns SQL GROUP BY
e.g. the above is:
select count (*)
from dt
group by cust_id
Clean: logic-level cleaning example
Sourcing,
Cleansing,
Exploration
Product hierarchies: assumed many product_ids <-> one product_category,
but actually a some product_ids belong to 2 or 3 product_cats:
> transcatid <- dt[,list(nbask=length(trans_id)),by=list(prod_cat,
prod_id)]
> transid <- transcatid[,list(ncat=length(prod_cat),nbask=sum(nbask))
,by=prod_id]
> transid[,length(prod_id),by=ncat]
ncat V1
1: 1 23557
2: 2 253
3: 3 2
Solution: dedupe. keep prod_id-prod_cat combos with the largest nbask
> ids <- transid[ncat>1,prod_id]
> transcatid[prod_id %in% ids,rank :=rank(-nbask),by=prod_id]
> goodprodcat <- transcatid[is.na(rank) | rank ==1,list(prod_cat,
prod_id)]
> dt <- merge(dt,goodprodcat)
Explore: Visualise (1 of 4) (ggplot)
Sourcing,
Cleansing,
Exploration
e.g. transactions by date
p1 <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) +
geom_bar(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8)
plot(p1)
Explore: Visualise (1.5 of 4) (ggplot)
Sourcing,
Cleansing,
Exploration
e.g. transactions by date (alternate plotting)
p1b <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) +
geom_point(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8) +
geom_smooth(aes(x=trans_date,y=num_trans), method=’loess’,alpha=0.8)
plot(p1b)
Explore: Visualise (2 of 4) (ggplot)
Sourcing,
Cleansing,
Exploration
e.g. histogram count of customers with N items bought
p2 <- ggplot(dt[,list(numitem=length(trans_id)),by=cust_id]) +
geom_bar(aes(x=numitem),stat='bin',binwidth=10,alpha=0.8,fill=orange)
+
coord_cartesian(xlim=c(0,200))
plot(p2)
Explore: Visualise (3 of 4) (ggplot)
Sourcing,
Cleansing,
Exploration
e.g. scatterplot of total items vs total baskets per customer
p4a <- ggplot(dttt) +
geom_point(aes(x=numbask,y=numitem),size=1,alpha=0.8) +
geom_smooth(aes(x=numbask,y=numitem),method="lm")
plot(p4a)
Explore: Visualise (4 of 4) (ggplot)
Sourcing,
Cleansing,
Exploration
e.g. scatterplot of total items vs total baskets per customer per res_area
p5 <- ggplot(dttt) +
geom_point(aes(x=numbask,y=numitem,color=res_area),size=1,alpha=0.8)
+
geom_smooth(aes(x=numbask,y=numitem),method="lm",color=colorBlind[1])
+
facet_wrap(~res_area)
plot(p5)
A-F: zipcode area:
105,106,110,114,115,221
G: others
H: Unknown
Dist to store, from closest:
E < F < D < A <B < C
Can we find or extract more info?
Feature
Creation
Create New Features
Feature
Creation
Per customer (32,000 of them):
Counts:
# total baskets (==unique days)
# total items
# total spend
# unique prod_subclass, unique prod_id
Distributions (min - med - max will do):
# items per basket
# spend per basket
# product_ids , prod_cats per basket
# duration between visits
Product preferences
# prop. of baskets in the N bands of product cats & ids by item pop.
# prop. of baskets in the N bands of product ids by item price
Frequency Summaries (data.table)
Feature
Creation
Pretty straightforward use of group by with data.table
> counts <- dt[,list(nbask=length(trans_id)
,nitem=sum(quantity)
,spend=sum(quantity*price))
,by=list(cust_id)]
> setkey(counts,cust_id)
> counts
cust_id nbask nitem spend
1: 00046855 1 3 171
2: 00539166 4 8 300
3: 00663373 1 1 135
Frequency Distributions (data.table)
Feature
Creation
Again, making use of data.table and list groupby to form new data table
> dists_ispb <- dt[,list(nitem=sum(quantity)
,spend=sum(quantity*price))
,by=list(cust_id,trans_date)]
> dists_ispb <- dists_ispb[,list(ipb_max=max(nitem)
,ipb_med=median(nitem)
,ipb_min=min(nitem)
,spb_max=max(spend)
,spb_med=median(spend)
,spb_min=min(spend))
,by=cust_id]
> setkey(dists_ispb,cust_id)
Considerations: it is acceptable
to lose datapoints?
Feature
Creation
Feature: duration between visits
If customers visited once only, they have value NA - causes issues later
Solutions:
A: remove them from modelling? wasteful in this case (lose 30%!)
But maybe we don’t care about classifying one-time shoppers
B: or give them all the same value
But which value? all == 0 isn't quite true, and many will skew clustering
C: impute values based on the global mean and SD of each col
Usually a reasonable fix, except for ratio columns, where very clumsy
and likely misleading, requiring mirroring to get +ve axes
Product Preferences (1 of 2) (Hmisc)
Feature
Creation
Trickier, since we don’t have a product hierarchy
e.g Food > Bakery > Bread > Sliced White > Hovis
But we do price per unit & inherent measure of popularity in the transaction
log, e.g.
> priceid <- dt[,list(aveprice=median(price)),by=prod_id]
> priceid[,prodid_pricerank:=LETTERS[as.numeric(cut2(aveprice,g=5))]]
# A low, E high
> priceid
prod_id aveprice prodid_pricerank
1: 4710085120468 21 A
2: 4714981010038 26 A
3: 4710265847666 185 D
Product Preferences (2 of 2) (dcast)
Feature
Creation
Now:
1. Merge the product price class back onto each transaction row
2. Reformat and sum transaction count in each class per customer id, e.g.
> dtpop_prodprice <- data.table(dcast(dtpop
,cust_id~prodid_pricerank
,value.var="trans_id"))
> dtpop_prodprice
cust_id A B C D E
1: 00001069 1 3 3 2 2
2: 00001113 7 1 4 5 1
3: 00001250 6 4 0 2 2
3. And further process to transform counts to proportions per row
Too many features, outliers?
Feature
Selection
Currently have a data.table 20,000 customers x 40 synthetic features
… which we hope represent their behaviour sufficiently to distinguish them
However, more features -> heavier processing for clustering
Can we / should we lighten the load?
Clean: Removing Outliers
Feature
Selection
Clustering doesn’t tend to like outliers, can we safely remove some?
Simple method: calculate z-score and threshold at some std dev value
Pre:
Post:
Clean: Removing Outliers (covMcd)
Feature
Selection
More fair method: calculate threshold based on Mahalanobis distance from
the group’s ‘central’ value (Minimum Covariance Determinant).
We’ll use covMcd which implements Rousseeuw’s FastMCD.
● Find a given proportion (h) of “good” observations which are not
outliers and compute their empirical covariance matrix.
● Rescale to compensate the performed selection of observations.
● Assign weights to observations according to their Mahalanobis
distance across all features.
● Cutoff at a threshold value, conventionally h/2
Principal Components Analysis (PCA)
Feature
Selection
Standard method for reducing dimensionality, maps each datapoint to a
new coordinate system created from principal components (PCs).
PCs are ordered:
● 1st PC aligned to max variance
in all features
● 2nd PC aligned to remaining
max variance in orthogonal
plane
...etc.
Each datapoint had N features; now
it has N PC values which are a
composite of the original features.
We can now use the first few PCs
for clustering and keep the majority
of the variance.
PCA (1 of 2) (scale & prcomp)
Feature
Selection
Scale first, so we can remove extreme outliers in original features
> cstZi_sc <- scale(cst[,which(!colnames(cst) %in% c("cust_id")),with=F])
> cstZall <- data.table(cst[,list(cust_id)],cstZi_sc)
Now all features are in units of 1 s.d.
For each row, if any one feature has value > 6 s.d., record in a filter vector
> sel <- apply(cstZall[,colnames(cstZi),with=F]
,1
,function(x){if(max(abs(x))>6){T}else{F}})
> cstZoutliers <- cstZall[sel]
> nrow(cstZoutliers) # 830 (/20381 == 4% loss)
Health warning: we’ve moved the centre, but prcomp will re-center for us
PCA (2 of 2) (scale & prcomp)
Feature
Selection
And now run prcomp to generate PCs
> cstPCAi <- prcomp(cstZ[,which(!colnames(cstZ) %in% c("cust_id")),
with=F])
> cstPCAi$sdev # sqrt of eigenvalues
> cstPCAi$rotation # loadings
> cstPCAi$x # PCs (aka scores)
> summary(cstPCAi)
Importance of components:
PC1 PC2 PC3 PC4 … etc
Standard deviation 2.1916 1.8488 1.7923 1.37567
Proportion of Variance 0.1746 0.1243 0.1168 0.06882
Cumulative Proportion 0.1746 0.2989 0.4158 0.48457
Wait, prcomp Vs princomp?
Apparently princomp is faster but potentially less accurate. Performance of
prcomp is very acceptable on this small dataset (20,000 x 40)
PCA quick visualisation (3 of 2)
Feature
Selection
Finally! Lets do some clustering
Clustering
Unsupervised learning
Many techniques inc.
Hierarchical hclust{stats}
K-Means kmeans{stats}
DBSCAN dbscan{fpc}
http://cran.r-project.
org/web/views/Cluster.html
… we’re going to play with
mixture models
Finite Mixture Modelling
Clustering
Assume each datapoint
has a mixture of classes,
each explained by a
different model.
Pick a number of
models and fit to the
data, best fit wins
Gaussian Mixture Modelling (GMM)
Clustering
● Models have Gaussian dist., we can vary params
● Place N models at a random point, move and fit to data
using Expectation Maximisation (EM) algorithm
● EM is iterative method
of finding the local
max likelihood
estimate.
● Slow but effective
● GMM advantage over
e.g. k-means is ability
to vary model params
to fit better
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.
pdf
Of course there’s an R package (mclust)
Clustering
Mclust v4, provides:
● Clustering,
classification,
density estimation
● Auto parameter
estimation
● Excellent default
plotting to aid live
investigation
In CRAN and detail at http://www.stat.washington.
edu/mclust/
Finding the optimal # models (mclust)
Clustering
Will automatically iterate over a number of models
(components) and covariance params
Will use the
combination with
best fit (highest
BIC)
C5, VVV
Interpreting model fit (1 of 3) (mclust)
Clustering
The classification pairs-plot lets us view the clustering by
principal component axis-pairs
PC2 PC3 PC4 PC5
PC1
Interpreting model fit (2 of 3) (mclust)
Clustering
PC2
PC1
‘Read’ the distributions w.r.
t PC components
PC1: “Variety axis”
Distinct products per basket and raw
count of distinct products overall
prodctpb_max 0.85
prodctpb_med 0.81
ipb_med 0.77
ipb_max 0.77
nprodcat 0.75
PC2: “Spendy axis”
Prop. baskets containing expensive
items, and simply raw count of items
and visits
popcat_nbaskE -0.71
popid_nbaskE -0.69
popcat_nbaskD 0.60
nbask -0.51
nitem -0.51
Interpreting model fit (3 of 3) (mclust)
Clustering
PC2: Reduced selection of
expensive items, fewer items
PC1:GreaterVariety
‘Read’ the distributions w.r.t PC components
Bob
Spendier,
higher variety,
family oriented?
Charles
Thriftier,
reduced selection,
shopping to a
budget?
Process R Toolkit
Sourcing, Cleaning & Exploration
What does the data let us do?
read.table {utils}
ggplot {ggplot2}
lubridate {lubridate}
Feature Creation
Extract additional information to enrich the set
data.table {data.table}
cut2 {Hmisc}
dcast {reshape2}
Feature Selection
Reduce to a smaller dataset to speed up computation
covMcd {robustbase}
scale {base}
prcomp {stats}
Clustering
Finding similar customers without prior information
… and interpreting the results
mclust {mclust}
We covered:
LondonR March 2014
Jonathan Sedar, Applied AI
@jonsedar
Thank you
Any questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
K means clustering
K means clusteringK means clustering
K means clusteringAhmedasbasb
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering TutorialZitao Liu
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 

Was ist angesagt? (20)

Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data clustering
Data clustering Data clustering
Data clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 

Ähnlich wie Customer Clustering For Retail Marketing

Customer Clustering for Retailer Marketing
Customer Clustering for Retailer MarketingCustomer Clustering for Retailer Marketing
Customer Clustering for Retailer MarketingJonathan Sedar
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)EMRE AKCAOGLU
 
Building a Scalable Inbox System with MongoDB and Java
Building a Scalable Inbox System with MongoDB and JavaBuilding a Scalable Inbox System with MongoDB and Java
Building a Scalable Inbox System with MongoDB and Javaantoinegirbal
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & AggregationWebinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & AggregationMongoDB
 
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...tdc-globalcode
 
Schema Design by Chad Tindel, Solution Architect, 10gen
Schema Design  by Chad Tindel, Solution Architect, 10genSchema Design  by Chad Tindel, Solution Architect, 10gen
Schema Design by Chad Tindel, Solution Architect, 10genMongoDB
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6Maxime Beugnet
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analyticsMongoDB
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryDaniel Cuneo
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning ProjectsUniversidad de los Andes
 

Ähnlich wie Customer Clustering For Retail Marketing (20)

Customer Clustering for Retailer Marketing
Customer Clustering for Retailer MarketingCustomer Clustering for Retailer Marketing
Customer Clustering for Retailer Marketing
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
 
Building a Scalable Inbox System with MongoDB and Java
Building a Scalable Inbox System with MongoDB and JavaBuilding a Scalable Inbox System with MongoDB and Java
Building a Scalable Inbox System with MongoDB and Java
 
2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & AggregationWebinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
 
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
 
Schema Design by Chad Tindel, Solution Architect, 10gen
Schema Design  by Chad Tindel, Solution Architect, 10genSchema Design  by Chad Tindel, Solution Architect, 10gen
Schema Design by Chad Tindel, Solution Architect, 10gen
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
KPMG - TASK 1.pdf
KPMG - TASK 1.pdfKPMG - TASK 1.pdf
KPMG - TASK 1.pdf
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
 

Mehr von Jonathan Sedar

Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data ScienceJonathan Sedar
 
How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?Jonathan Sedar
 
Visualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNEVisualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNEJonathan Sedar
 
Bayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionBayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionJonathan Sedar
 
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Jonathan Sedar
 
Applied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science DeptApplied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science DeptJonathan Sedar
 
Text mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science projectText mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science projectJonathan Sedar
 

Mehr von Jonathan Sedar (7)

Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?
 
Visualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNEVisualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNE
 
Bayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionBayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier Detection
 
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
 
Applied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science DeptApplied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science Dept
 
Text mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science projectText mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science project
 

Kürzlich hochgeladen

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 

Kürzlich hochgeladen (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 

Customer Clustering For Retail Marketing

  • 1. LondonR March 2014 Jonathan Sedar, Applied AI @jonsedar Customer Clustering for Retailer Marketing An exploratory data science project with reference to useful R packages
  • 2. Customer Clustering something something something ... ● Thats a convoluted title! ● Client problem is really: “Tell me something about my customer base” ● Time to do some “Exploratory Data Science” ● aka data hacking ● Take a sample of data, fire up an R session and lets see what we can learn…
  • 3. Process R Toolkit Sourcing, Cleaning & Exploration What does the data let us do? read.table {utils} ggplot {ggplot2} lubridate {lubridate} Feature Creation Extract additional information to enrich the set data.table {data.table} cut2 {Hmisc} dcast {reshape2} Feature Selection Reduce to a smaller dataset to speed up computation covMcd {robustbase} scale {base} prcomp {stats} Clustering Finding similar customers without prior information … and interpreting the results mclust {mclust} Overview
  • 4. “Who’s shopping at my stores and how can I market to them?” Intro
  • 5. Lets try to group them by similar shopping behaviours Intro
  • 6. Benefits: customer profiling enables targeted marketing and operations Retention offers Product promotions Loyalty rewards Optimise stock levels & store layout Bob Lives “SW15” Age “40” Bob Lives “SW15” Age “40” Type: Family First Intro
  • 7. Turning existing data into insights A real dataset: 32,000 customers 24,000 items 800,000 transactions … over a 4 month pd Magic Many ways to approach the problem! R has tools to help reach a quick solution Intro
  • 8. Exploratory data science projects tend to have a similar structure Model optimisation & interpretation Source your raw data Cleaning & importing Exploration & visualisation Feature creation Feature selection Model creation Intro
  • 10. Sample dataset Sourcing, Cleansing, Exploration “Ta-Feng” grocery shopping dataset 800,000 transactions 32,000 customer ids 24,000 product ids 4-month period over Winter 2000-2001 http://recsyswiki.com/wiki/Grocery_shopping_datasets trans_date cust_id age res_area product_id quantity price 1: 2000-11-01 00046855 D E 4710085120468 3 57 2: 2000-11-01 00539166 E E 4714981010038 2 48 3: 2000-11-01 00663373 F E 4710265847666 1 135 ...
  • 11. Data definition and audit (1 of 2) A README file, excellent... 4 ASCII text files: # D11: Transaction data collected in November, 2000 # D12: Transaction data collected in December, 2000 # D01: Transaction data collected in January, 2001 # D02: Transaction data collected in February, 2001 Curious choice of delimiter and an extended charset # First line: Column definition in Traditional Chinese # §È¥¡;∑|≠˚•d∏π;¶~ƒ÷;∞œ∞Ï;∞”´~§¿√˛;∞”´~ΩsΩX;º∆∂q;¶®•ª;æP∞‚ # Second line and the rest: data columns separated by ";" Preclean with shell tools awk -F";" 'gsub(":","",$1)' D02 Sourcing, Cleansing, Exploration
  • 12. Data definition and audit (2 of 2) Be prepared for undocumented gotchas: # 1: Transaction date and time (time invalid and useless) # 2: Customer ID # 3: Age: 10 possible values, # A <25,B 25-29,C 30-34,D 35-39,E 40-44,F 45-49,G 50-54,H 55-59,I 60-64,J >65 # actually there's 22362 rows with value K, will assume it's Unknown # 4: Residence Area: 8 possible values, # A-F: zipcode area: 105,106,110,114,115,221,G: others, H: Unknown # Distance to store, from the closest: 115,221,114,105,106,110 # so we’ll factor this with levels "E","F","D","A","B","C","G","H" # 5: Product subclass # 6: Product ID # 7: Amount # 8: Asset # not explained, low values, not an id, will ignore # 9: Sales price Sourcing, Cleansing, Exploration
  • 13. Import & preprocess (read.table) Sourcing, Cleansing, Exploration Read each file into a data.table, whilst applying basic data types > dtnov <- data.table(read.table(fqn ,col.names=cl$names ,colClasses=cl$types ,encoding="UTF-8" ,stringsAsFactors=F)); Alternatives inc RODBC / RPostgreSQL > con <- dbConnect(dbDriver("PostgreSQL"), host="localhost", port=5432 ,dbname=”tafeng”, user=”jon”, password=””) > dtnov <- dbGetQuery(con,"select * from NovTransactions")
  • 14. Import & preprocess (lubridate) Sourcing, Cleansing, Exploration Convert some datatypes to be more useful: String -> POSIXct datetime (UNIX time UTC) using lubridate > dtraw[,trans_date:= ymd(trans_date)] > cat(ymd("2013-11-05")) 1383609600 … also, applying factor levels to the residential area > dtraw[,res_area:= factor(res_area ,levels=c("E","F","D","A","B","C","G","H") )]
  • 15. Explore: Group By (data.table) Sourcing, Cleansing, Exploration How many transactions, dates, customers, products, product subclasses? > nrow(dt[,.N,by=cust_id]) # 32,266 Using data.table’s dt[i,j,k] structure where: i subselects rows SQL WHERE j selects / creates columns SQL SELECT k groups by columns SQL GROUP BY e.g. the above is: select count (*) from dt group by cust_id
  • 16. Clean: logic-level cleaning example Sourcing, Cleansing, Exploration Product hierarchies: assumed many product_ids <-> one product_category, but actually a some product_ids belong to 2 or 3 product_cats: > transcatid <- dt[,list(nbask=length(trans_id)),by=list(prod_cat, prod_id)] > transid <- transcatid[,list(ncat=length(prod_cat),nbask=sum(nbask)) ,by=prod_id] > transid[,length(prod_id),by=ncat] ncat V1 1: 1 23557 2: 2 253 3: 3 2 Solution: dedupe. keep prod_id-prod_cat combos with the largest nbask > ids <- transid[ncat>1,prod_id] > transcatid[prod_id %in% ids,rank :=rank(-nbask),by=prod_id] > goodprodcat <- transcatid[is.na(rank) | rank ==1,list(prod_cat, prod_id)] > dt <- merge(dt,goodprodcat)
  • 17. Explore: Visualise (1 of 4) (ggplot) Sourcing, Cleansing, Exploration e.g. transactions by date p1 <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) + geom_bar(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8) plot(p1)
  • 18. Explore: Visualise (1.5 of 4) (ggplot) Sourcing, Cleansing, Exploration e.g. transactions by date (alternate plotting) p1b <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) + geom_point(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8) + geom_smooth(aes(x=trans_date,y=num_trans), method=’loess’,alpha=0.8) plot(p1b)
  • 19. Explore: Visualise (2 of 4) (ggplot) Sourcing, Cleansing, Exploration e.g. histogram count of customers with N items bought p2 <- ggplot(dt[,list(numitem=length(trans_id)),by=cust_id]) + geom_bar(aes(x=numitem),stat='bin',binwidth=10,alpha=0.8,fill=orange) + coord_cartesian(xlim=c(0,200)) plot(p2)
  • 20. Explore: Visualise (3 of 4) (ggplot) Sourcing, Cleansing, Exploration e.g. scatterplot of total items vs total baskets per customer p4a <- ggplot(dttt) + geom_point(aes(x=numbask,y=numitem),size=1,alpha=0.8) + geom_smooth(aes(x=numbask,y=numitem),method="lm") plot(p4a)
  • 21. Explore: Visualise (4 of 4) (ggplot) Sourcing, Cleansing, Exploration e.g. scatterplot of total items vs total baskets per customer per res_area p5 <- ggplot(dttt) + geom_point(aes(x=numbask,y=numitem,color=res_area),size=1,alpha=0.8) + geom_smooth(aes(x=numbask,y=numitem),method="lm",color=colorBlind[1]) + facet_wrap(~res_area) plot(p5) A-F: zipcode area: 105,106,110,114,115,221 G: others H: Unknown Dist to store, from closest: E < F < D < A <B < C
  • 22. Can we find or extract more info? Feature Creation
  • 23. Create New Features Feature Creation Per customer (32,000 of them): Counts: # total baskets (==unique days) # total items # total spend # unique prod_subclass, unique prod_id Distributions (min - med - max will do): # items per basket # spend per basket # product_ids , prod_cats per basket # duration between visits Product preferences # prop. of baskets in the N bands of product cats & ids by item pop. # prop. of baskets in the N bands of product ids by item price
  • 24. Frequency Summaries (data.table) Feature Creation Pretty straightforward use of group by with data.table > counts <- dt[,list(nbask=length(trans_id) ,nitem=sum(quantity) ,spend=sum(quantity*price)) ,by=list(cust_id)] > setkey(counts,cust_id) > counts cust_id nbask nitem spend 1: 00046855 1 3 171 2: 00539166 4 8 300 3: 00663373 1 1 135
  • 25. Frequency Distributions (data.table) Feature Creation Again, making use of data.table and list groupby to form new data table > dists_ispb <- dt[,list(nitem=sum(quantity) ,spend=sum(quantity*price)) ,by=list(cust_id,trans_date)] > dists_ispb <- dists_ispb[,list(ipb_max=max(nitem) ,ipb_med=median(nitem) ,ipb_min=min(nitem) ,spb_max=max(spend) ,spb_med=median(spend) ,spb_min=min(spend)) ,by=cust_id] > setkey(dists_ispb,cust_id)
  • 26. Considerations: it is acceptable to lose datapoints? Feature Creation Feature: duration between visits If customers visited once only, they have value NA - causes issues later Solutions: A: remove them from modelling? wasteful in this case (lose 30%!) But maybe we don’t care about classifying one-time shoppers B: or give them all the same value But which value? all == 0 isn't quite true, and many will skew clustering C: impute values based on the global mean and SD of each col Usually a reasonable fix, except for ratio columns, where very clumsy and likely misleading, requiring mirroring to get +ve axes
  • 27. Product Preferences (1 of 2) (Hmisc) Feature Creation Trickier, since we don’t have a product hierarchy e.g Food > Bakery > Bread > Sliced White > Hovis But we do price per unit & inherent measure of popularity in the transaction log, e.g. > priceid <- dt[,list(aveprice=median(price)),by=prod_id] > priceid[,prodid_pricerank:=LETTERS[as.numeric(cut2(aveprice,g=5))]] # A low, E high > priceid prod_id aveprice prodid_pricerank 1: 4710085120468 21 A 2: 4714981010038 26 A 3: 4710265847666 185 D
  • 28. Product Preferences (2 of 2) (dcast) Feature Creation Now: 1. Merge the product price class back onto each transaction row 2. Reformat and sum transaction count in each class per customer id, e.g. > dtpop_prodprice <- data.table(dcast(dtpop ,cust_id~prodid_pricerank ,value.var="trans_id")) > dtpop_prodprice cust_id A B C D E 1: 00001069 1 3 3 2 2 2: 00001113 7 1 4 5 1 3: 00001250 6 4 0 2 2 3. And further process to transform counts to proportions per row
  • 29. Too many features, outliers? Feature Selection Currently have a data.table 20,000 customers x 40 synthetic features … which we hope represent their behaviour sufficiently to distinguish them However, more features -> heavier processing for clustering Can we / should we lighten the load?
  • 30. Clean: Removing Outliers Feature Selection Clustering doesn’t tend to like outliers, can we safely remove some? Simple method: calculate z-score and threshold at some std dev value Pre: Post:
  • 31. Clean: Removing Outliers (covMcd) Feature Selection More fair method: calculate threshold based on Mahalanobis distance from the group’s ‘central’ value (Minimum Covariance Determinant). We’ll use covMcd which implements Rousseeuw’s FastMCD. ● Find a given proportion (h) of “good” observations which are not outliers and compute their empirical covariance matrix. ● Rescale to compensate the performed selection of observations. ● Assign weights to observations according to their Mahalanobis distance across all features. ● Cutoff at a threshold value, conventionally h/2
  • 32. Principal Components Analysis (PCA) Feature Selection Standard method for reducing dimensionality, maps each datapoint to a new coordinate system created from principal components (PCs). PCs are ordered: ● 1st PC aligned to max variance in all features ● 2nd PC aligned to remaining max variance in orthogonal plane ...etc. Each datapoint had N features; now it has N PC values which are a composite of the original features. We can now use the first few PCs for clustering and keep the majority of the variance.
  • 33. PCA (1 of 2) (scale & prcomp) Feature Selection Scale first, so we can remove extreme outliers in original features > cstZi_sc <- scale(cst[,which(!colnames(cst) %in% c("cust_id")),with=F]) > cstZall <- data.table(cst[,list(cust_id)],cstZi_sc) Now all features are in units of 1 s.d. For each row, if any one feature has value > 6 s.d., record in a filter vector > sel <- apply(cstZall[,colnames(cstZi),with=F] ,1 ,function(x){if(max(abs(x))>6){T}else{F}}) > cstZoutliers <- cstZall[sel] > nrow(cstZoutliers) # 830 (/20381 == 4% loss) Health warning: we’ve moved the centre, but prcomp will re-center for us
  • 34. PCA (2 of 2) (scale & prcomp) Feature Selection And now run prcomp to generate PCs > cstPCAi <- prcomp(cstZ[,which(!colnames(cstZ) %in% c("cust_id")), with=F]) > cstPCAi$sdev # sqrt of eigenvalues > cstPCAi$rotation # loadings > cstPCAi$x # PCs (aka scores) > summary(cstPCAi) Importance of components: PC1 PC2 PC3 PC4 … etc Standard deviation 2.1916 1.8488 1.7923 1.37567 Proportion of Variance 0.1746 0.1243 0.1168 0.06882 Cumulative Proportion 0.1746 0.2989 0.4158 0.48457 Wait, prcomp Vs princomp? Apparently princomp is faster but potentially less accurate. Performance of prcomp is very acceptable on this small dataset (20,000 x 40)
  • 35. PCA quick visualisation (3 of 2) Feature Selection
  • 36. Finally! Lets do some clustering Clustering Unsupervised learning Many techniques inc. Hierarchical hclust{stats} K-Means kmeans{stats} DBSCAN dbscan{fpc} http://cran.r-project. org/web/views/Cluster.html … we’re going to play with mixture models
  • 37. Finite Mixture Modelling Clustering Assume each datapoint has a mixture of classes, each explained by a different model. Pick a number of models and fit to the data, best fit wins
  • 38. Gaussian Mixture Modelling (GMM) Clustering ● Models have Gaussian dist., we can vary params ● Place N models at a random point, move and fit to data using Expectation Maximisation (EM) algorithm ● EM is iterative method of finding the local max likelihood estimate. ● Slow but effective ● GMM advantage over e.g. k-means is ability to vary model params to fit better http://ai.stanford.edu/~chuongdo/papers/em_tutorial. pdf
  • 39. Of course there’s an R package (mclust) Clustering Mclust v4, provides: ● Clustering, classification, density estimation ● Auto parameter estimation ● Excellent default plotting to aid live investigation In CRAN and detail at http://www.stat.washington. edu/mclust/
  • 40. Finding the optimal # models (mclust) Clustering Will automatically iterate over a number of models (components) and covariance params Will use the combination with best fit (highest BIC) C5, VVV
  • 41. Interpreting model fit (1 of 3) (mclust) Clustering The classification pairs-plot lets us view the clustering by principal component axis-pairs PC2 PC3 PC4 PC5 PC1
  • 42. Interpreting model fit (2 of 3) (mclust) Clustering PC2 PC1 ‘Read’ the distributions w.r. t PC components PC1: “Variety axis” Distinct products per basket and raw count of distinct products overall prodctpb_max 0.85 prodctpb_med 0.81 ipb_med 0.77 ipb_max 0.77 nprodcat 0.75 PC2: “Spendy axis” Prop. baskets containing expensive items, and simply raw count of items and visits popcat_nbaskE -0.71 popid_nbaskE -0.69 popcat_nbaskD 0.60 nbask -0.51 nitem -0.51
  • 43. Interpreting model fit (3 of 3) (mclust) Clustering PC2: Reduced selection of expensive items, fewer items PC1:GreaterVariety ‘Read’ the distributions w.r.t PC components Bob Spendier, higher variety, family oriented? Charles Thriftier, reduced selection, shopping to a budget?
  • 44. Process R Toolkit Sourcing, Cleaning & Exploration What does the data let us do? read.table {utils} ggplot {ggplot2} lubridate {lubridate} Feature Creation Extract additional information to enrich the set data.table {data.table} cut2 {Hmisc} dcast {reshape2} Feature Selection Reduce to a smaller dataset to speed up computation covMcd {robustbase} scale {base} prcomp {stats} Clustering Finding similar customers without prior information … and interpreting the results mclust {mclust} We covered:
  • 45. LondonR March 2014 Jonathan Sedar, Applied AI @jonsedar Thank you Any questions?