SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Copyright © 2006, SAS Institute Inc. All rights reserved.
Applications of Soft Clustering in
Fraud and Credit Risk Modeling
Vijay S. Desai
CSCC X, Edinburgh
August 29, 2007
SAS Proprietary, patents pending 2
Need for Segmentation In Risk Modeling
Can build better models for homogeneous
segments
Accounts exhibit different behavior
• Transactors use credits cards for convenience and
rewards, Revolvers use them for unsecured credit
• Business travelers use cards differently from infrequent
travelers
• For the same individual, behavior for card-of-choice
different from other cards owned by the individual
Data availability
• Short history for Young accounts versus Mature
accounts
SAS Proprietary, patents pending 3
Use of Clustering to Segment Accounts
Manual segmentation can be replaced by data
driven methods like clustering
Use relevant data to cluster accounts
• Use transaction volume and revolving balance data to
cluster revolvers, transactors
• Use delinquency information to cluster clean, dirty
accounts
K-means clustering
• Starts with k randomly selected seeds and assigns data
to these clusters using some similarity measure
• Centroids of the k clusters are computed and data
reassigned to clusters based upon some similarity
measure
• Computationally efficient, hence the most popular
method in practice
SAS Proprietary, patents pending 4
Issues with Segmentation of Accounts
Behavior is not permanent
• Situational revolvers
• Business travelers use cards at home
• Card-of-choice changes with utilization, incentives
Data availability
• Hard segmentation exacerbates data availability
problems
− Especially true for rare events such as fraud
Censoring creates large gaps in data
• Issuers don’t offer high credit limits to low credit
individuals
K-means clustering
• Susceptible to local optima, outliers
• Only applicable for data with defined means
SAS Proprietary, patents pending 5
Ways to Alleviate Some Problems With Clustering
Use domain expertise to assign initial
cluster seeds
Use multiple starting points to alleviate
local optima problems
Soft clustering: each data point
belongs to all clusters in graded
degree
• Cluster membership determined by
distance from center.
• Data-driven: Cluster centers and shape
updated intelligently
Soft-clustering allows same account to
be used in multiple models 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Fraction of months with revolving balance
ClusterMembership
Transactors
Revolvers
Situational Revolvers
0.79
0.15
0.08
X
X
SAS Proprietary, patents pending 6
Soft Clustering Methods
Fuzzy clustering
• Fuzzy c-means clustering
• Fuzzy c-means clustering with extragrades
Possibilistic clustering
• Possibilistic c-means clustering
Kernel based clustering
• Kernel K-means
• Kernel fuzzy c-means clustering
• Kernel possibilistic c-means clustering
SAS Proprietary, patents pending 7
Fuzzy Clustering
Similar to k-means clustering
Extends the notion of membership, data point might
belong to more than one cluster
Examples:
• Situational revolvers should not be forced into revolvers or
transactors
• Accounts with a single late payment should not be forced into
current or delinquent segments
Bezdek (1981)
SAS Proprietary, patents pending 8
Fuzzy c-means Algorithm
exponentweightingais),1(
pointdatatocenter vcluster
fromdistanceEuclideanthe||||
igroupfuzzyofcenterclustertheis
10
where
(2)Eq.n1,...,i,1s.t.
(1)Eq.),(Minimize
,
cr,
1
,
1 1
2
,
∞∈
=−=
≤≤
=∀=
=
∑
∑ ∑
=
= =
m
x
vxd
u
u
duVJ
ki
ikki
ik
c
i
ki
n
k
c
i
ki
m
ik
µ
U
SAS Proprietary, patents pending 9
Fuzzy c-means Algorithm: Solution
Picard iterations to update the centroids and
membership until J(U,V) changes are below
threshold
(4)Eq.
1
asmembershipUpdate
(3)Eq.
ascentroidsUpdate
:schemeIterationPicard
1
)1(
2
1
1
∑
∑
∑
=
−
=
=






=
=
c
j
m
ik
ik
ik
n
k
m
ik
c
j
k
m
ik
i
d
d
u
u
u
v
x
SAS Proprietary, patents pending 10
Fuzzy c-means clustering with Extragrades
McBratney and De Gruijter (1992)
Alleviates the influence of outliers by adding a
penalty term to the objective function
Potential outliers are put in “extragrades” classes
Solution involves Picard iterations updating the
centroids and membership matrices
m
ku *
)1(),(Minimize
1 1
2
,*
1 1
2
, ∑ ∑∑∑ = =
−
= =
−+=
n
k
c
i
ki
m
k
n
k
c
i
ki
m
ik duduVJ ααU
SAS Proprietary, patents pending 11
Possibilistic Clustering
Relaxes the probabilistic constraint on membership
(Eq. (2))
Allows an observation to have low membership in all
clusters, e.g., outliers
Allows an observation to have high membership in
multiple clusters, e.g., overlapping clusters
Krishnapuram and Keller (1993), (1996)
(1996)))ln((),(Minimize
(1993))1(),(Minimize
1 1
,,,
1 1
2
,
1 1
,
1 1
2
,
∑ ∑∑ ∑
∑ ∑∑ ∑
= == =
= == =
−+=
−+=
c
i
n
k
kikikii
n
k
c
i
kiik
mc
i
n
k
kii
n
k
c
i
ki
m
ik
uuuduVJ
uduVJ
η
η
U
U
SAS Proprietary, patents pending 12
Kernel Basics
Kernel definition K(x,y)
• A similarity measure
• Implicit mapping φ, from input space to
feature space
• Condition: K(x,y)=φ(x)•φ(y)
Theorem: K(x,y) is a valid kernel if K is
positive definite and symmetric (Mercer
Kernel)
• A function is P.D. if
• In other words, the Gram matrix K (whose
elements are K(xi,xj)) must be positive
definite for all xi, xj of the input space
Kernel Examples:
∫ ∈∀≥ 20)()(),( LfddffK yxyxyx
σ2/
2
y)K(x,
yx,y)K(x,
yx
d
e
−−
=
〉〈=
SAS Proprietary, patents pending 13
Kernel Based Clustering
Kernel based clustering increases the separability of
the input data by mapping them to a new high
dimensional space
Kernel functions allows the exploration of data
patterns in new space, in contrast to K-means with
Euclidean distance which expects the data to
distribute into elliptical regions
Zhong, Xie and Yu (2003)
)()(),(Minimize
1 1
2
∑∑= =
Φ−Φ=
n
k
c
i
ik
m
ik vxuVJ Uφ
SAS Proprietary, patents pending 14
Iris Data Set Description
4 variables, 3 classes, 50 observations per class
Class 1 easily separable, Class 2-3 overlapping
SAS Proprietary, patents pending 15
Iris Data Set: K-Means Clustering Results
Classes 1 and 2
identified
Class 3 (Virginica)
misclassified
SAS Proprietary, patents pending 16
Iris Data: Soft Clustering Results
Fuzzy Clustering
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
Observation Number
MembershipProbability
Class 3 Class 2
Fuzzy Clustering Extragrades
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
Observations
MembershipProbability
Class 3 Class 2
Possibilistic Clustering (1993)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
Observation
MembershipPossibility
Class 3 Class 2
P o s s ib ilis tic C lu s te rin g (1 9 9 6 )
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 5 9 13 17 2 1 2 5 2 9 33 37 41 45 49 5 3 5 7 6 1 65 69 73 77 81 85 8 9 9 3 97
O b s erv atio n
MembershipPossibility C la ss 3 C lass 2
SAS Proprietary, patents pending 17
Suggestions on When to Use Each Method
Very large data sets with many clusters
• K-means clustering
Data sets with thin shells, overlapping clusters
• Fuzzy K-means
• e.g., current, delinquent, dirty accounts
Data sets with outliers
• Fuzzy K-means with extragrades
Data sets with overlapping clusters, noisy data, other clustering
results not consistent with intuitive concepts of compatibility
• Possibilistic clustering
• e.g., revolvers, transactors, situational revolvers
Data sets with non-spherical density distributions, e.g., donuts,
spirals
• Kernel based clustering in feature space
• e.g., censored data
SAS Proprietary, patents pending 18
Using the Soft Clustering Results
Model building phase
• Build model for each cluster
• Use all accounts that have a significant cluster membership
probability
• Model building data weighs each account in proportion to its
membership probability
Scoring phase
• Score each account using all models from segments for
which account had a significant membership probability
• The final output score for the account is a weighted average
of all the segment based scores
Resulting scores are more stable, particularly for
accounts with changing behavior
Results more robust for clusters that would have few
accounts under traditional approaches
SAS Proprietary, patents pending 19
Q & A
Thank you for the opportunity to present

Weitere ähnliche Inhalte

Was ist angesagt?

Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupYan Xu
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataVictor Smirnov
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalAlexander Decker
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering TypesAshwin Shenoy M
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseDan Han
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 

Was ist angesagt? (20)

Ch21
Ch21Ch21
Ch21
 
Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML Meetup
 
Chapter8
Chapter8Chapter8
Chapter8
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big Data
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
R refcard-data-mining
R refcard-data-miningR refcard-data-mining
R refcard-data-mining
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 
Aggregation and Subsetting in ERDDAP
Aggregation and Subsetting in ERDDAPAggregation and Subsetting in ERDDAP
Aggregation and Subsetting in ERDDAP
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 

Andere mochten auch

soft-computing
 soft-computing soft-computing
soft-computingstudent
 
Introduction to soft computing
Introduction to soft computingIntroduction to soft computing
Introduction to soft computingAnkush Kumar
 
Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique IOSR Journals
 
Overlapping clusters for distributed computation
Overlapping clusters for distributed computationOverlapping clusters for distributed computation
Overlapping clusters for distributed computationDavid Gleich
 
Robotic Surgery PPT
Robotic Surgery PPTRobotic Surgery PPT
Robotic Surgery PPTSai Charan
 
ROBOTICS AND ITS APPLICATIONS
ROBOTICS AND ITS APPLICATIONSROBOTICS AND ITS APPLICATIONS
ROBOTICS AND ITS APPLICATIONSAnmol Seth
 

Andere mochten auch (11)

Soft computing01
Soft computing01Soft computing01
Soft computing01
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
soft-computing
 soft-computing soft-computing
soft-computing
 
Basics of Soft Computing
Basics of Soft  Computing Basics of Soft  Computing
Basics of Soft Computing
 
Introduction to soft computing
Introduction to soft computingIntroduction to soft computing
Introduction to soft computing
 
Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
Soft computing
Soft computingSoft computing
Soft computing
 
Overlapping clusters for distributed computation
Overlapping clusters for distributed computationOverlapping clusters for distributed computation
Overlapping clusters for distributed computation
 
Robotic Surgery PPT
Robotic Surgery PPTRobotic Surgery PPT
Robotic Surgery PPT
 
ROBOTICS AND ITS APPLICATIONS
ROBOTICS AND ITS APPLICATIONSROBOTICS AND ITS APPLICATIONS
ROBOTICS AND ITS APPLICATIONS
 

Ähnlich wie CSCC-X2007

CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSVipul Thakur
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...KamleshKumar394
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudPasquale Salza
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptxNaveen Kumar
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 

Ähnlich wie CSCC-X2007 (20)

CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMS
 
Birch1
Birch1Birch1
Birch1
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the Cloud
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptx
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 

CSCC-X2007

  • 1. Copyright © 2006, SAS Institute Inc. All rights reserved. Applications of Soft Clustering in Fraud and Credit Risk Modeling Vijay S. Desai CSCC X, Edinburgh August 29, 2007
  • 2. SAS Proprietary, patents pending 2 Need for Segmentation In Risk Modeling Can build better models for homogeneous segments Accounts exhibit different behavior • Transactors use credits cards for convenience and rewards, Revolvers use them for unsecured credit • Business travelers use cards differently from infrequent travelers • For the same individual, behavior for card-of-choice different from other cards owned by the individual Data availability • Short history for Young accounts versus Mature accounts
  • 3. SAS Proprietary, patents pending 3 Use of Clustering to Segment Accounts Manual segmentation can be replaced by data driven methods like clustering Use relevant data to cluster accounts • Use transaction volume and revolving balance data to cluster revolvers, transactors • Use delinquency information to cluster clean, dirty accounts K-means clustering • Starts with k randomly selected seeds and assigns data to these clusters using some similarity measure • Centroids of the k clusters are computed and data reassigned to clusters based upon some similarity measure • Computationally efficient, hence the most popular method in practice
  • 4. SAS Proprietary, patents pending 4 Issues with Segmentation of Accounts Behavior is not permanent • Situational revolvers • Business travelers use cards at home • Card-of-choice changes with utilization, incentives Data availability • Hard segmentation exacerbates data availability problems − Especially true for rare events such as fraud Censoring creates large gaps in data • Issuers don’t offer high credit limits to low credit individuals K-means clustering • Susceptible to local optima, outliers • Only applicable for data with defined means
  • 5. SAS Proprietary, patents pending 5 Ways to Alleviate Some Problems With Clustering Use domain expertise to assign initial cluster seeds Use multiple starting points to alleviate local optima problems Soft clustering: each data point belongs to all clusters in graded degree • Cluster membership determined by distance from center. • Data-driven: Cluster centers and shape updated intelligently Soft-clustering allows same account to be used in multiple models 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Fraction of months with revolving balance ClusterMembership Transactors Revolvers Situational Revolvers 0.79 0.15 0.08 X X
  • 6. SAS Proprietary, patents pending 6 Soft Clustering Methods Fuzzy clustering • Fuzzy c-means clustering • Fuzzy c-means clustering with extragrades Possibilistic clustering • Possibilistic c-means clustering Kernel based clustering • Kernel K-means • Kernel fuzzy c-means clustering • Kernel possibilistic c-means clustering
  • 7. SAS Proprietary, patents pending 7 Fuzzy Clustering Similar to k-means clustering Extends the notion of membership, data point might belong to more than one cluster Examples: • Situational revolvers should not be forced into revolvers or transactors • Accounts with a single late payment should not be forced into current or delinquent segments Bezdek (1981)
  • 8. SAS Proprietary, patents pending 8 Fuzzy c-means Algorithm exponentweightingais),1( pointdatatocenter vcluster fromdistanceEuclideanthe|||| igroupfuzzyofcenterclustertheis 10 where (2)Eq.n1,...,i,1s.t. (1)Eq.),(Minimize , cr, 1 , 1 1 2 , ∞∈ =−= ≤≤ =∀= = ∑ ∑ ∑ = = = m x vxd u u duVJ ki ikki ik c i ki n k c i ki m ik µ U
  • 9. SAS Proprietary, patents pending 9 Fuzzy c-means Algorithm: Solution Picard iterations to update the centroids and membership until J(U,V) changes are below threshold (4)Eq. 1 asmembershipUpdate (3)Eq. ascentroidsUpdate :schemeIterationPicard 1 )1( 2 1 1 ∑ ∑ ∑ = − = =       = = c j m ik ik ik n k m ik c j k m ik i d d u u u v x
  • 10. SAS Proprietary, patents pending 10 Fuzzy c-means clustering with Extragrades McBratney and De Gruijter (1992) Alleviates the influence of outliers by adding a penalty term to the objective function Potential outliers are put in “extragrades” classes Solution involves Picard iterations updating the centroids and membership matrices m ku * )1(),(Minimize 1 1 2 ,* 1 1 2 , ∑ ∑∑∑ = = − = = −+= n k c i ki m k n k c i ki m ik duduVJ ααU
  • 11. SAS Proprietary, patents pending 11 Possibilistic Clustering Relaxes the probabilistic constraint on membership (Eq. (2)) Allows an observation to have low membership in all clusters, e.g., outliers Allows an observation to have high membership in multiple clusters, e.g., overlapping clusters Krishnapuram and Keller (1993), (1996) (1996)))ln((),(Minimize (1993))1(),(Minimize 1 1 ,,, 1 1 2 , 1 1 , 1 1 2 , ∑ ∑∑ ∑ ∑ ∑∑ ∑ = == = = == = −+= −+= c i n k kikikii n k c i kiik mc i n k kii n k c i ki m ik uuuduVJ uduVJ η η U U
  • 12. SAS Proprietary, patents pending 12 Kernel Basics Kernel definition K(x,y) • A similarity measure • Implicit mapping φ, from input space to feature space • Condition: K(x,y)=φ(x)•φ(y) Theorem: K(x,y) is a valid kernel if K is positive definite and symmetric (Mercer Kernel) • A function is P.D. if • In other words, the Gram matrix K (whose elements are K(xi,xj)) must be positive definite for all xi, xj of the input space Kernel Examples: ∫ ∈∀≥ 20)()(),( LfddffK yxyxyx σ2/ 2 y)K(x, yx,y)K(x, yx d e −− = 〉〈=
  • 13. SAS Proprietary, patents pending 13 Kernel Based Clustering Kernel based clustering increases the separability of the input data by mapping them to a new high dimensional space Kernel functions allows the exploration of data patterns in new space, in contrast to K-means with Euclidean distance which expects the data to distribute into elliptical regions Zhong, Xie and Yu (2003) )()(),(Minimize 1 1 2 ∑∑= = Φ−Φ= n k c i ik m ik vxuVJ Uφ
  • 14. SAS Proprietary, patents pending 14 Iris Data Set Description 4 variables, 3 classes, 50 observations per class Class 1 easily separable, Class 2-3 overlapping
  • 15. SAS Proprietary, patents pending 15 Iris Data Set: K-Means Clustering Results Classes 1 and 2 identified Class 3 (Virginica) misclassified
  • 16. SAS Proprietary, patents pending 16 Iris Data: Soft Clustering Results Fuzzy Clustering 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Observation Number MembershipProbability Class 3 Class 2 Fuzzy Clustering Extragrades 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Observations MembershipProbability Class 3 Class 2 Possibilistic Clustering (1993) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Observation MembershipPossibility Class 3 Class 2 P o s s ib ilis tic C lu s te rin g (1 9 9 6 ) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 5 9 13 17 2 1 2 5 2 9 33 37 41 45 49 5 3 5 7 6 1 65 69 73 77 81 85 8 9 9 3 97 O b s erv atio n MembershipPossibility C la ss 3 C lass 2
  • 17. SAS Proprietary, patents pending 17 Suggestions on When to Use Each Method Very large data sets with many clusters • K-means clustering Data sets with thin shells, overlapping clusters • Fuzzy K-means • e.g., current, delinquent, dirty accounts Data sets with outliers • Fuzzy K-means with extragrades Data sets with overlapping clusters, noisy data, other clustering results not consistent with intuitive concepts of compatibility • Possibilistic clustering • e.g., revolvers, transactors, situational revolvers Data sets with non-spherical density distributions, e.g., donuts, spirals • Kernel based clustering in feature space • e.g., censored data
  • 18. SAS Proprietary, patents pending 18 Using the Soft Clustering Results Model building phase • Build model for each cluster • Use all accounts that have a significant cluster membership probability • Model building data weighs each account in proportion to its membership probability Scoring phase • Score each account using all models from segments for which account had a significant membership probability • The final output score for the account is a weighted average of all the segment based scores Resulting scores are more stable, particularly for accounts with changing behavior Results more robust for clusters that would have few accounts under traditional approaches
  • 19. SAS Proprietary, patents pending 19 Q & A Thank you for the opportunity to present